From 67ad62f7190cb9c0a380eeae4a15c5c3e5fd869e Mon Sep 17 00:00:00 2001 From: Vanessasaurus <814322+vsoch@users.noreply.github.com> Date: Sat, 23 Sep 2023 18:03:03 -0600 Subject: [PATCH] [wip] second design for metrics operator (#63) * WIP to refactor This is going to be a huge refactor to remove the application/storage "hard coded" legos replaced by a more flexible setup where we have one base metric set (no subtypes) and then metrics generate the replicated jobs (as many as they like, how they please) and then addons are provided to them, which can range from additional volumes to containers (that provide volumes) to any kind of customization. This is not ready for any kind of testing but I am mostly concerned about my computer blowing up and losing the work so I am saving for good measure :) Also, yay today! :D * definitely making bad life decisions * very satisfying deletion of things. * lammps ran! * amg is back * bdas is back * add back hpl we did not get this completely working before (likely the spack mpi install as a basic hostname does not work ) so a basic conversion is sufficient * add back kripke * laghos * test signing again * add back nekbone * add back pennant * add back quicksilver also simplify logic of applications - the launcher worker pattern is generic and can be shared * workflow format bug * add back fio * add back host volume example * add back ior * add back osu benchmarks! * add back chatterbug it is accepted this does not fully work, we need to come back to it. * add back netmark * systat and lammps working again * hpctoolkit design at least works but shared libraries are failing to load. HPCToolkit you are a jerk. I am laughing. And crying. And mostly crying. * clean up docs a little bit * addon documentation is good * hopefully fix bug * fixing workingdir bug! * update to v1alpha2 * bugfix * a single touch marker at the end of the copy is more reliable than a file that is part of it! * support to customize container for any metric, and for hpctoolkit to run post commands * support for custom container * add print at end of post analysis for hpctoolkit * fixing bug with internal crd state if we do not make a copy (refect) of the interface, the state seems to change (and perist) between runs. While I am still worried about this design, this at least seems to fix that bug. I am also wondering about garbage collection (e.g., if making the copies means they stay around and the operator will use increasing memory) but that is TBA explored. Signed-off-by: vsoch --- .github/workflows/main.yaml | 28 +- .github/workflows/python.yaml | 2 +- .github/workflows/release.yaml | 2 +- Makefile | 3 +- PROJECT | 4 +- README.md | 1 + .../groupversion_info.go | 6 +- api/{v1alpha1 => v1alpha2}/metric_types.go | 227 +---- .../zz_generated.deepcopy.go | 128 ++- chart/templates/metricset-crd.yaml | 2 +- .../bases/flux-framework.org_metricsets.yaml | 212 ++--- config/samples/_v1alpha1_metricset.yaml | 12 - config/samples/kustomization.yaml | 4 - controllers/metric/configmap.go | 31 +- controllers/metric/metric.go | 82 +- controllers/metric/metric_controller.go | 58 +- controllers/metric/service.go | 2 +- controllers/metric/suite_test.go | 4 +- docs/_static/data/addons.html | 477 +++++++++++ docs/_static/data/addons.json | 37 + docs/_static/data/metrics.json | 25 - docs/_static/data/table.html | 14 +- docs/development/designs/current.md | 58 ++ .../{designs.md => designs/design1.md} | 0 .../img/application-metric-set.png | Bin .../img/application-metric-volume.png | Bin .../img/standalone-metric-set.png | Bin .../{ => designs}/img/storage-metric-set.png | Bin docs/development/designs/index.md | 10 + docs/development/developer-guide.md | 37 +- docs/development/index.md | 2 +- docs/development/metrics.md | 4 +- docs/getting_started/addons.md | 278 ++++++ .../custom-resource-definition.md | 278 +----- docs/getting_started/index.md | 1 + docs/getting_started/metrics.md | 137 +-- docs/getting_started/user-guide.md | 194 +---- docs/make.bat | 0 examples/dist/metrics-operator-arm.yaml | 188 +--- examples/dist/metrics-operator.yaml | 188 +--- examples/python/app-amg/metrics.json | 800 +----------------- examples/python/io-fio/metrics.yaml | 2 +- examples/python/network-netmark/metrics.yaml | 2 +- examples/python/perf-hello-world/metrics.yaml | 2 +- examples/python/perf-sysstat/metrics.yaml | 2 +- examples/tests/app-amg/metrics.yaml | 3 +- examples/tests/app-bdas/metrics.yaml | 2 +- examples/tests/app-hpl/metrics.yaml | 2 +- examples/tests/app-kripke/metrics.yaml | 2 +- examples/tests/app-laghos/metrics.yaml | 2 +- examples/tests/app-lammps/README.md | 2 +- examples/tests/app-lammps/metrics.yaml | 9 +- examples/tests/app-ldms/metrics.yaml | 2 +- examples/tests/app-nekbone/metrics.yaml | 2 +- examples/tests/app-pennant/metrics.yaml | 2 +- examples/tests/app-quicksilver/metrics.yaml | 2 +- examples/tests/io-fio/metrics.yaml | 18 +- examples/tests/io-fio/post-run.sh | 2 +- examples/tests/io-host-volume/metrics.yaml | 18 +- examples/tests/io-ior/metrics.yaml | 17 +- .../tests/network-chatterbug/metrics.yaml | 4 +- examples/tests/network-netmark/metrics.yaml | 3 +- .../tests/network-osu-benchmark/metrics.yaml | 5 +- examples/tests/perf-hello-world/metrics.yaml | 15 +- examples/tests/perf-hpctoolkit/metrics.yaml | 2 +- .../perf-lammps-hpctoolkit/metrics-rocky.yaml | 49 ++ .../tests/perf-lammps-hpctoolkit/metrics.yaml | 41 + examples/tests/perf-lammps/metrics.yaml | 13 +- hack/addons-gen/main.go | 53 ++ hack/{docs-gen => metrics-gen}/main.go | 5 +- main.go | 5 +- pkg/addons/addons.go | 132 +++ pkg/addons/containers.go | 192 +++++ pkg/addons/hpctoolkit.go | 408 +++++++++ pkg/addons/logs.go | 51 ++ pkg/addons/volumes.go | 393 +++++++++ pkg/jobs/application.go | 115 --- pkg/jobs/launcher.go | 328 ------- pkg/jobs/logs.go | 29 - pkg/jobs/storage.go | 93 -- pkg/metadata/metadata.go | 56 ++ pkg/metrics/app/amg.go | 96 +-- pkg/metrics/app/bdas.go | 127 +-- pkg/metrics/app/hpl.go | 107 ++- pkg/metrics/app/kripke.go | 88 +- pkg/metrics/app/laghos.go | 88 +- pkg/metrics/app/lammps.go | 121 +-- pkg/metrics/app/ldms.go | 91 +- pkg/metrics/app/nekbone.go | 89 +- pkg/metrics/app/pennant.go | 91 +- pkg/metrics/app/quicksilver.go | 91 +- pkg/metrics/application.go | 137 +-- pkg/metrics/base.go | 201 +++++ pkg/metrics/containers.go | 149 +--- pkg/metrics/io/fio.go | 127 ++- pkg/metrics/io/ior.go | 79 +- pkg/metrics/io/sysstat.go | 86 +- pkg/metrics/jobset.go | 156 ++-- pkg/metrics/launcher.go | 291 +++++++ pkg/metrics/logs.go | 63 +- pkg/metrics/metrics.go | 81 +- pkg/metrics/metricset.go | 165 ---- pkg/metrics/network/chatterbug.go | 81 +- pkg/metrics/network/netmark.go | 103 ++- pkg/metrics/network/osu-benchmark.go | 91 +- pkg/metrics/perf/hpctoolkit.go | 190 ----- pkg/metrics/perf/sysstat.go | 116 +-- pkg/metrics/resources.go | 2 +- pkg/metrics/set.go | 146 ++++ pkg/metrics/standalone.go | 50 -- pkg/metrics/storage.go | 71 +- pkg/metrics/volumes.go | 205 ++--- pkg/specs/specs.go | 79 ++ script/test.sh | 0 sdk/python/{v1alpha1 => v1alpha2}/.gitignore | 0 .../{v1alpha1 => v1alpha2}/CHANGELOG.md | 0 sdk/python/{v1alpha1 => v1alpha2}/MANIFEST.in | 0 sdk/python/{v1alpha1 => v1alpha2}/README.md | 0 .../metricsoperator/__init__.py | 0 .../metricsoperator/client.py | 0 .../metricsoperator/metrics/__init__.py | 0 .../metricsoperator/metrics/app/__init__.py | 0 .../metricsoperator/metrics/app/amg.py | 0 .../metricsoperator/metrics/app/lammps.py | 0 .../metricsoperator/metrics/base.py | 0 .../metrics/network/__init__.py | 0 .../metrics/network/netmark.py | 0 .../metrics/network/osu_benchmark.py | 0 .../metricsoperator/metrics/perf.py | 0 .../metricsoperator/metrics/storage.py | 0 .../metricsoperator/utils.py | 0 .../{v1alpha1 => v1alpha2}/pyproject.toml | 0 sdk/python/{v1alpha1 => v1alpha2}/setup.py | 4 +- setup.cfg | 4 +- 134 files changed, 4604 insertions(+), 4683 deletions(-) rename api/{v1alpha1 => v1alpha2}/groupversion_info.go (90%) rename api/{v1alpha1 => v1alpha2}/metric_types.go (50%) rename api/{v1alpha1 => v1alpha2}/zz_generated.deepcopy.go (86%) delete mode 100644 config/samples/_v1alpha1_metricset.yaml delete mode 100644 config/samples/kustomization.yaml create mode 100644 docs/_static/data/addons.html create mode 100644 docs/_static/data/addons.json create mode 100644 docs/development/designs/current.md rename docs/development/{designs.md => designs/design1.md} (100%) rename docs/development/{ => designs}/img/application-metric-set.png (100%) rename docs/development/{ => designs}/img/application-metric-volume.png (100%) rename docs/development/{ => designs}/img/standalone-metric-set.png (100%) rename docs/development/{ => designs}/img/storage-metric-set.png (100%) create mode 100644 docs/development/designs/index.md create mode 100644 docs/getting_started/addons.md mode change 100644 => 100755 docs/make.bat create mode 100644 examples/tests/perf-lammps-hpctoolkit/metrics-rocky.yaml create mode 100644 examples/tests/perf-lammps-hpctoolkit/metrics.yaml create mode 100644 hack/addons-gen/main.go rename hack/{docs-gen => metrics-gen}/main.go (94%) create mode 100644 pkg/addons/addons.go create mode 100644 pkg/addons/containers.go create mode 100644 pkg/addons/hpctoolkit.go create mode 100644 pkg/addons/logs.go create mode 100644 pkg/addons/volumes.go delete mode 100644 pkg/jobs/application.go delete mode 100644 pkg/jobs/launcher.go delete mode 100644 pkg/jobs/logs.go delete mode 100644 pkg/jobs/storage.go create mode 100644 pkg/metadata/metadata.go create mode 100644 pkg/metrics/base.go create mode 100644 pkg/metrics/launcher.go delete mode 100644 pkg/metrics/metricset.go delete mode 100644 pkg/metrics/perf/hpctoolkit.go create mode 100644 pkg/metrics/set.go delete mode 100644 pkg/metrics/standalone.go create mode 100644 pkg/specs/specs.go mode change 100755 => 100644 script/test.sh rename sdk/python/{v1alpha1 => v1alpha2}/.gitignore (100%) rename sdk/python/{v1alpha1 => v1alpha2}/CHANGELOG.md (100%) rename sdk/python/{v1alpha1 => v1alpha2}/MANIFEST.in (100%) rename sdk/python/{v1alpha1 => v1alpha2}/README.md (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/__init__.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/client.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/__init__.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/app/__init__.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/app/amg.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/app/lammps.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/base.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/network/__init__.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/network/netmark.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/network/osu_benchmark.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/perf.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/metrics/storage.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/metricsoperator/utils.py (100%) rename sdk/python/{v1alpha1 => v1alpha2}/pyproject.toml (100%) rename sdk/python/{v1alpha1 => v1alpha2}/setup.py (97%) diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml index d1f2369..4292ec4 100644 --- a/.github/workflows/main.yaml +++ b/.github/workflows/main.yaml @@ -16,7 +16,7 @@ jobs: - name: Check Spelling uses: crate-ci/typos@7ad296c72fa8265059cc03d1eda562fbdfcd6df2 # v1.9.0 with: - files: ./README.md ./config/samples ./docs/*.md ./docs/*/*.md + files: ./README.md ./docs/*.md ./docs/*/*.md ./docs/*/*/*.md - name: Lint and format Python code run: | @@ -66,19 +66,19 @@ jobs: strategy: fail-fast: false matrix: - test: [["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # performance test - ["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # storage test - ["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120], # storage test - ["io-ior", "ghcr.io/converged-computing/metric-ior:latest", 120], # storage test - # ["network-chatterbug", "ghcr.io/converged-computing/metric-chatterbug:latest", 120], # network app test - ["app-nekbone", "ghcr.io/converged-computing/metric-nekbone:latest", 120], # standalone app test - # ["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120], # standalone app test - ["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120], # standalone app test - ["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120], # standalone app test - ["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120], # standalone app test - ["app-bdas", "ghcr.io/converged-computing/metric-bdas:latest", 120], # standalone app test - ["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120], # standalone app test - ["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120]] # standalone app test + test: [["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120], + ["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60], + ["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60], + ["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120], + ["io-ior", "ghcr.io/converged-computing/metric-ior:latest", 120], + ## ["network-chatterbug", "ghcr.io/converged-computing/metric-chatterbug:latest", 120], + ["app-nekbone", "ghcr.io/converged-computing/metric-nekbone:latest", 120], + ["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120], + ["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120], + ["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120], + ["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120], + ["app-bdas", "ghcr.io/converged-computing/metric-bdas:latest", 120], + ["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120]] steps: - name: Clone the code diff --git a/.github/workflows/python.yaml b/.github/workflows/python.yaml index f11e9b9..6fb6a4e 100644 --- a/.github/workflows/python.yaml +++ b/.github/workflows/python.yaml @@ -27,7 +27,7 @@ jobs: run: | export PATH="/usr/share/miniconda/bin:$PATH" source activate mo - cd sdk/python/v1alpha1 + cd sdk/python/v1alpha2 pip install . pip install seaborn pandas diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml index 8b3d8a5..81cc2a4 100644 --- a/.github/workflows/release.yaml +++ b/.github/workflows/release.yaml @@ -106,7 +106,7 @@ jobs: run: | export PATH="/usr/share/miniconda/bin:$PATH" source activate mo - cd sdk/python/v1alpha1/ + cd sdk/python/v1alpha2/ pip install -e . python setup.py sdist bdist_wheel cd dist diff --git a/Makefile b/Makefile index 2e8f8e7..f8bd25a 100644 --- a/Makefile +++ b/Makefile @@ -323,7 +323,8 @@ helm: manifests kustomize helmify .PHONY: docs-data docs-data: - go run hack/docs-gen/main.go docs/_static/data/metrics.json + go run hack/metrics-gen/main.go docs/_static/data/metrics.json + go run hack/addons-gen/main.go docs/_static/data/addons.json .PHONY: pre-push pre-push: generate build-config-arm build-config docs-data diff --git a/PROJECT b/PROJECT index 6627170..f88ad5a 100644 --- a/PROJECT +++ b/PROJECT @@ -17,6 +17,6 @@ resources: controller: true domain: flux-framework.org kind: MetricSet - path: github.com/converged-computing/metrics-operator/api/v1alpha1 - version: v1alpha1 + path: github.com/converged-computing/metrics-operator/api/v1alpha2 + version: v1alpha2 version: "3" diff --git a/README.md b/README.md index 0f3d733..fbabff9 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,7 @@ To learn more: ## Dinosaur TODO +- Figure out issue with errors.IsNotFound not working... - We need a way for the entrypoint command to monitor (based on the container) to differ (potentially) - For larger metric collections, we should have a log streaming mode (and not wait for Completed/Successful) - For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application diff --git a/api/v1alpha1/groupversion_info.go b/api/v1alpha2/groupversion_info.go similarity index 90% rename from api/v1alpha1/groupversion_info.go rename to api/v1alpha2/groupversion_info.go index 2b367e2..0db23ee 100644 --- a/api/v1alpha1/groupversion_info.go +++ b/api/v1alpha2/groupversion_info.go @@ -14,10 +14,10 @@ See the License for the specific language governing permissions and limitations under the License. */ -// Package v1alpha1 contains API Schema definitions for the v1alpha1 API group +// Package v1alpha2 contains API Schema definitions for the v1alpha2 API group // +kubebuilder:object:generate=true // +groupName=flux-framework.org -package v1alpha1 +package v1alpha2 import ( "k8s.io/apimachinery/pkg/runtime/schema" @@ -26,7 +26,7 @@ import ( var ( // GroupVersion is group version used to register these objects - GroupVersion = schema.GroupVersion{Group: "flux-framework.org", Version: "v1alpha1"} + GroupVersion = schema.GroupVersion{Group: "flux-framework.org", Version: "v1alpha2"} // SchemeBuilder is used to add go types to the GroupVersionKind scheme SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion} diff --git a/api/v1alpha1/metric_types.go b/api/v1alpha2/metric_types.go similarity index 50% rename from api/v1alpha1/metric_types.go rename to api/v1alpha2/metric_types.go index 4dec943..64e9cb1 100644 --- a/api/v1alpha1/metric_types.go +++ b/api/v1alpha2/metric_types.go @@ -14,11 +14,10 @@ See the License for the specific language governing permissions and limitations under the License. */ -package v1alpha1 +package v1alpha2 import ( "fmt" - "reflect" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/util/intstr" @@ -51,19 +50,10 @@ type MetricSetSpec struct { // +optional DeadlineSeconds int64 `json:"deadlineSeconds,omitempty"` - // A storage setup that we want to measure performance for. - // and binding to storage metrics - // +optional - Storage Storage `json:"storage"` - // Pod spec for the application, standalone, or storage metrics //+optional Pod Pod `json:"pod"` - // For metrics that require an application, we need a container and name (for now) - // +optional - Application Application `json:"application"` - // Parallelism (e.g., pods) // +kubebuilder:default=1 // +default=1 @@ -78,11 +68,6 @@ type MetricSetSpec struct { // Right now we just include an interactive option //+optional Logging Logging `json:"logging"` - - // Single pod completion, meaning the jobspec completions is unset - // and we only require one main completion - // +optional - Completions int32 `json:"completions"` } type Logging struct { @@ -114,51 +99,34 @@ type ContainerSpec struct { } type SecurityContext struct { - Privileged bool `json:"privileged"` -} -// Storage that will be monitored, or storage alongside a standalone metric -type Storage struct { - - // Volume type to test (not all storage interfaces require one explicitly) //+optional - Volume Volume `json:"volume"` - - // Commands to run (pre is supported to make bind) - // +optional - Commands Commands `json:"commands"` -} - -// Application that will be monitored -type Application struct { - Image string `json:"image"` - - // command to execute and monitor (if consistent across pods) - Command string `json:"command"` + Privileged bool `json:"privileged"` - // Working Directory //+optional - WorkingDir string `json:"workingDir"` + AllowPtrace bool `json:"allowPtrace"` - // Entrypoint of container, if different from command //+optional - Entrypoint string `json:"entrypoint"` + AllowAdmin bool `json:"allowAdmin"` +} - // A pull secret for the application container - //+optional - PullSecret string `json:"pullSecret"` +// A Metric addon is an interface that exposes extra volumes for a metric. Examples include: +// A storage volume to be mounted on one or more of the replicated jobs +// A single application container. +type MetricAddon struct { + Name string `json:"name"` - // Resources include limits and requests for the application + // Metric Addon Options // +optional - Resources ContainerResources `json:"resources"` + Options map[string]intstr.IntOrString `json:"options"` - // Container Spec has attributes for the container - //+optional - Attributes ContainerSpec `json:"attributes"` + // Addon List Options + // +optional + ListOptions map[string][]intstr.IntOrString `json:"listOptions"` - // Existing Volumes for the application + // Addon Map Options // +optional - Volumes map[string]Volume `json:"volumes"` + MapOptions map[string]map[string]intstr.IntOrString `json:"mapOptions"` } // ContainerResources include limits and requests @@ -188,51 +156,16 @@ type Commands struct { type ContainerResource map[string]intstr.IntOrString -// A Volume should correspond with an existing volume, either: -// config map, secret, or claim name. -type Volume struct { - - // Path and claim name are always required if a secret isn't defined - // +optional - Path string `json:"path,omitempty"` - - // Hostpath volume on the host to bind to path - // +optional - HostPath string `json:"hostPath"` - - // Config map name if the existing volume is a config map - // You should also define items if you are using this - // +optional - ConfigMapName string `json:"configMapName,omitempty"` - - // Items (key and paths) for the config map - // +optional - Items map[string]string `json:"items"` - - // Claim name if the existing volume is a PVC - // +optional - ClaimName string `json:"claimName,omitempty"` - - // An existing secret - // +optional - SecretName string `json:"secretName,omitempty"` - - // EmptyVol if true generates an empty volume at the path - // +kubebuilder:default=false - // +default=false - // +optional - EmptyVol bool `json:"emptyVol,omitempty"` - - // +kubebuilder:default=false - // +default=false - // +optional - ReadOnly bool `json:"readOnly,omitempty"` -} - // The difference between benchmark and metric is subtle. // A metric is more a measurment, and the benchmark is the comparison value. // I don't have strong opinions but I think we are doing more measurment // not necessarily with benchmarks + +// A metric is basically a container. It minimally provides: +// In the simplest case, a sidecar container (e.g., service or similar) +// Optionally: Possibly additional volumes that can be mounted in +// With a shared process namespace, ability to monitor + type Metric struct { Name string `json:"name"` @@ -241,6 +174,16 @@ type Metric struct { // +optional Options map[string]intstr.IntOrString `json:"options"` + // Use a custom container image (advanced users only) + // +optional + Image string `json:"image,omitempty"` + + // A Metric addon can be storage (volume) or an application, + // It's an additional entity that can customize a replicated job, + // either adding assets / features or entire containers to the pod + //+optional + Addons []MetricAddon `json:"addons"` + // Metric List Options // Metric specific options // +optional @@ -287,36 +230,9 @@ type MetricSet struct { Status MetricSetStatus `json:"status,omitempty"` } -// Determine if an application or storage is present, or standalone -func (m *MetricSet) HasApplication() bool { - return !reflect.DeepEqual(m.Spec.Application, Application{}) -} -func (m *MetricSet) HasStorage() bool { - return !reflect.DeepEqual(m.Spec.Storage, Storage{}) -} -func (m *MetricSet) HasStorageVolume() bool { - return !reflect.DeepEqual(m.Spec.Storage.Volume, Volume{}) -} -func (m *MetricSet) IsStandalone() bool { - return !m.HasStorage() && !m.HasApplication() -} - // Validate a requested metricset func (m *MetricSet) Validate() bool { - // An application or storage setup is required - if !m.HasApplication() && !m.HasStorage() && !m.IsStandalone() { - fmt.Printf("😥️ An application OR storage OR standalone entry is required.\n") - return false - } - - // We don't currently support running both at once - // (but should be fine to allow extra standalone) - if m.HasApplication() && m.HasStorage() { - fmt.Printf("😥️ An application OR storage entry is required, not both.\n") - return false - } - if len(m.Spec.Metrics) == 0 { fmt.Printf("😥️ One or more metrics are required.\n") return false @@ -325,86 +241,9 @@ func (m *MetricSet) Validate() bool { fmt.Printf("😥️ Pods must be >= 1.") return false } - - // Validation for application - if m.HasApplication() { - if m.Spec.Application.Command == "" { - fmt.Printf("😥️ Application is missing a command.") - return false - } - - if m.Spec.Application.Entrypoint == "" { - m.Spec.Application.Entrypoint = m.Spec.Application.Command - } - - // For existing volumes, if it's a claim, a path is required. - if !m.validateVolumes(m.Spec.Application.Volumes) { - fmt.Printf("😥️ Application container volumes are not valid\n") - return false - } - } - - // Validate for storage - if m.HasStorage() && !m.validateVolumes(map[string]Volume{"storage": m.Spec.Storage.Volume}) { - fmt.Printf("😥️ Storage volumes are not valid\n") - return false - } - - // If completions unset, set to parallelism - if m.Spec.Completions == 0 { - m.Spec.Completions = m.Spec.Pods - } - - // A standalone metric by definition runs alone - if len(m.Spec.Metrics) > 1 && m.IsStandalone() { - fmt.Printf("😥️ A standalone metric by definition runs on its own\n") - return false - } - return true } -// validateExistingVolumes ensures secret names vs. volume paths are valid -func (m *MetricSet) validateVolumes(volumes map[string]Volume) bool { - - valid := true - for key, volume := range volumes { - - // Case 1: it's a secret and we only need that - if volume.SecretName != "" { - continue - } - - // Case 2: it's a config map (and will have items too, but we don't hard require them) - if volume.ConfigMapName != "" { - continue - } - - // Case 3: Hostpath volume (mostly for testing) - if volume.HostPath != "" { - continue - } - - // Case 4: claim desired without path - if volume.ClaimName == "" && volume.Path != "" { - fmt.Printf("😥️ Found existing volume %s with path %s that is missing a claim name\n", key, volume.Path) - valid = false - } - // Case 5: reverse of the above - if volume.ClaimName != "" && volume.Path == "" { - fmt.Printf("😥️ Found existing volume %s with claimName %s that is missing a path\n", key, volume.ClaimName) - valid = false - } - - // Case 6: empty volume needs path - if volume.EmptyVol && volume.Path == "" { - fmt.Printf("😥️ Found empty volume %s that is missing a path\n", key) - valid = false - } - } - return valid -} - //+kubebuilder:object:root=true // MetricSetList contains a list of MetricSet diff --git a/api/v1alpha1/zz_generated.deepcopy.go b/api/v1alpha2/zz_generated.deepcopy.go similarity index 86% rename from api/v1alpha1/zz_generated.deepcopy.go rename to api/v1alpha2/zz_generated.deepcopy.go index 4d81598..cb43d2a 100644 --- a/api/v1alpha1/zz_generated.deepcopy.go +++ b/api/v1alpha2/zz_generated.deepcopy.go @@ -19,37 +19,13 @@ limitations under the License. // Code generated by controller-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( runtime "k8s.io/apimachinery/pkg/runtime" "k8s.io/apimachinery/pkg/util/intstr" ) -// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *Application) DeepCopyInto(out *Application) { - *out = *in - in.Resources.DeepCopyInto(&out.Resources) - out.Attributes = in.Attributes - if in.Volumes != nil { - in, out := &in.Volumes, &out.Volumes - *out = make(map[string]Volume, len(*in)) - for key, val := range *in { - (*out)[key] = *val.DeepCopy() - } - } -} - -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new Application. -func (in *Application) DeepCopy() *Application { - if in == nil { - return nil - } - out := new(Application) - in.DeepCopyInto(out) - return out -} - // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *Commands) DeepCopyInto(out *Commands) { *out = *in @@ -156,6 +132,13 @@ func (in *Metric) DeepCopyInto(out *Metric) { (*out)[key] = val } } + if in.Addons != nil { + in, out := &in.Addons, &out.Addons + *out = make([]MetricAddon, len(*in)) + for i := range *in { + (*in)[i].DeepCopyInto(&(*out)[i]) + } + } if in.ListOptions != nil { in, out := &in.ListOptions, &out.ListOptions *out = make(map[string][]intstr.IntOrString, len(*in)) @@ -202,6 +185,60 @@ func (in *Metric) DeepCopy() *Metric { return out } +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *MetricAddon) DeepCopyInto(out *MetricAddon) { + *out = *in + if in.Options != nil { + in, out := &in.Options, &out.Options + *out = make(map[string]intstr.IntOrString, len(*in)) + for key, val := range *in { + (*out)[key] = val + } + } + if in.ListOptions != nil { + in, out := &in.ListOptions, &out.ListOptions + *out = make(map[string][]intstr.IntOrString, len(*in)) + for key, val := range *in { + var outVal []intstr.IntOrString + if val == nil { + (*out)[key] = nil + } else { + in, out := &val, &outVal + *out = make([]intstr.IntOrString, len(*in)) + copy(*out, *in) + } + (*out)[key] = outVal + } + } + if in.MapOptions != nil { + in, out := &in.MapOptions, &out.MapOptions + *out = make(map[string]map[string]intstr.IntOrString, len(*in)) + for key, val := range *in { + var outVal map[string]intstr.IntOrString + if val == nil { + (*out)[key] = nil + } else { + in, out := &val, &outVal + *out = make(map[string]intstr.IntOrString, len(*in)) + for key, val := range *in { + (*out)[key] = val + } + } + (*out)[key] = outVal + } + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new MetricAddon. +func (in *MetricAddon) DeepCopy() *MetricAddon { + if in == nil { + return nil + } + out := new(MetricAddon) + in.DeepCopyInto(out) + return out +} + // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *MetricSet) DeepCopyInto(out *MetricSet) { *out = *in @@ -271,9 +308,7 @@ func (in *MetricSetSpec) DeepCopyInto(out *MetricSetSpec) { (*in)[i].DeepCopyInto(&(*out)[i]) } } - in.Storage.DeepCopyInto(&out.Storage) in.Pod.DeepCopyInto(&out.Pod) - in.Application.DeepCopyInto(&out.Application) if in.Resources != nil { in, out := &in.Resources, &out.Resources *out = make(ContainerResource, len(*in)) @@ -345,42 +380,3 @@ func (in *SecurityContext) DeepCopy() *SecurityContext { in.DeepCopyInto(out) return out } - -// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *Storage) DeepCopyInto(out *Storage) { - *out = *in - in.Volume.DeepCopyInto(&out.Volume) - out.Commands = in.Commands -} - -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new Storage. -func (in *Storage) DeepCopy() *Storage { - if in == nil { - return nil - } - out := new(Storage) - in.DeepCopyInto(out) - return out -} - -// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. -func (in *Volume) DeepCopyInto(out *Volume) { - *out = *in - if in.Items != nil { - in, out := &in.Items, &out.Items - *out = make(map[string]string, len(*in)) - for key, val := range *in { - (*out)[key] = val - } - } -} - -// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new Volume. -func (in *Volume) DeepCopy() *Volume { - if in == nil { - return nil - } - out := new(Volume) - in.DeepCopyInto(out) - return out -} diff --git a/chart/templates/metricset-crd.yaml b/chart/templates/metricset-crd.yaml index 3a3a090..9dfffe2 100644 --- a/chart/templates/metricset-crd.yaml +++ b/chart/templates/metricset-crd.yaml @@ -15,7 +15,7 @@ spec: singular: metricset scope: Namespaced versions: - - name: v1alpha1 + - name: v1alpha2 schema: openAPIV3Schema: description: MetricSet is the Schema for the metrics API diff --git a/config/crd/bases/flux-framework.org_metricsets.yaml b/config/crd/bases/flux-framework.org_metricsets.yaml index f4249d2..ad58e31 100644 --- a/config/crd/bases/flux-framework.org_metricsets.yaml +++ b/config/crd/bases/flux-framework.org_metricsets.yaml @@ -15,7 +15,7 @@ spec: singular: metricset scope: Namespaced versions: - - name: v1alpha1 + - name: v1alpha2 schema: openAPIV3Schema: description: MetricSet is the Schema for the metrics API @@ -35,103 +35,6 @@ spec: spec: description: MetricSpec defines the desired state of Metric properties: - application: - description: For metrics that require an application, we need a container - and name (for now) - properties: - attributes: - description: Container Spec has attributes for the container - properties: - securityContext: - description: Security context for the pod - properties: - privileged: - type: boolean - required: - - privileged - type: object - type: object - command: - description: command to execute and monitor (if consistent across - pods) - type: string - entrypoint: - description: Entrypoint of container, if different from command - type: string - image: - type: string - pullSecret: - description: A pull secret for the application container - type: string - resources: - description: Resources include limits and requests for the application - properties: - limits: - additionalProperties: - anyOf: - - type: integer - - type: string - x-kubernetes-int-or-string: true - type: object - requests: - additionalProperties: - anyOf: - - type: integer - - type: string - x-kubernetes-int-or-string: true - type: object - type: object - volumes: - additionalProperties: - description: 'A Volume should correspond with an existing volume, - either: config map, secret, or claim name.' - properties: - claimName: - description: Claim name if the existing volume is a PVC - type: string - configMapName: - description: Config map name if the existing volume is a - config map You should also define items if you are using - this - type: string - emptyVol: - default: false - description: EmptyVol if true generates an empty volume - at the path - type: boolean - hostPath: - description: Hostpath volume on the host to bind to path - type: string - items: - additionalProperties: - type: string - description: Items (key and paths) for the config map - type: object - path: - description: Path and claim name are always required if - a secret isn't defined - type: string - readOnly: - default: false - type: boolean - secretName: - description: An existing secret - type: string - type: object - description: Existing Volumes for the application - type: object - workingDir: - description: Working Directory - type: string - required: - - command - - image - type: object - completions: - description: Single pod completion, meaning the jobspec completions - is unset and we only require one main completion - format: int32 - type: integer deadlineSeconds: default: 31500000 description: Should the job be limited to a particular number of seconds? @@ -155,23 +58,69 @@ spec: description: The name of the metric (that will be associated with a flavor like storage) items: - description: The difference between benchmark and metric is subtle. - A metric is more a measurment, and the benchmark is the comparison - value. I don't have strong opinions but I think we are doing more - measurment not necessarily with benchmarks properties: + addons: + description: A Metric addon can be storage (volume) or an application, + It's an additional entity that can customize a replicated + job, either adding assets / features or entire containers + to the pod + items: + description: 'A Metric addon is an interface that exposes + extra volumes for a metric. Examples include: A storage + volume to be mounted on one or more of the replicated jobs + A single application container.' + properties: + listOptions: + additionalProperties: + items: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + type: array + description: Addon List Options + type: object + mapOptions: + additionalProperties: + additionalProperties: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + type: object + description: Addon Map Options + type: object + name: + type: string + options: + additionalProperties: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + description: Metric Addon Options + type: object + required: + - name + type: object + type: array attributes: description: Container Spec has attributes for the container properties: securityContext: description: Security context for the pod properties: + allowAdmin: + type: boolean + allowPtrace: + type: boolean privileged: type: boolean - required: - - privileged type: object type: object + image: + description: Use a custom container image (advanced users only) + type: string listOptions: additionalProperties: items: @@ -256,61 +205,6 @@ spec: default: ms description: Service name for the JobSet (MetricsSet) cluster network type: string - storage: - description: A storage setup that we want to measure performance for. - and binding to storage metrics - properties: - commands: - description: Commands to run (pre is supported to make bind) - properties: - post: - description: post happens at end (after collection end) - type: string - pre: - description: pre command happens at start (before anything - else) - type: string - prefix: - description: Command prefix to put in front of a metric main - command (not applicable for all) - type: string - type: object - volume: - description: Volume type to test (not all storage interfaces require - one explicitly) - properties: - claimName: - description: Claim name if the existing volume is a PVC - type: string - configMapName: - description: Config map name if the existing volume is a config - map You should also define items if you are using this - type: string - emptyVol: - default: false - description: EmptyVol if true generates an empty volume at - the path - type: boolean - hostPath: - description: Hostpath volume on the host to bind to path - type: string - items: - additionalProperties: - type: string - description: Items (key and paths) for the config map - type: object - path: - description: Path and claim name are always required if a - secret isn't defined - type: string - readOnly: - default: false - type: boolean - secretName: - description: An existing secret - type: string - type: object - type: object type: object status: description: MetricStatus defines the observed state of Metric diff --git a/config/samples/_v1alpha1_metricset.yaml b/config/samples/_v1alpha1_metricset.yaml deleted file mode 100644 index 1ead022..0000000 --- a/config/samples/_v1alpha1_metricset.yaml +++ /dev/null @@ -1,12 +0,0 @@ -apiVersion: flux-framework.org/v1alpha1 -kind: MetricSet -metadata: - labels: - app.kubernetes.io/name: metricset - app.kubernetes.io/instance: metricset-sample - app.kubernetes.io/part-of: test - app.kubernetes.io/managed-by: kustomize - app.kubernetes.io/created-by: test - name: metricset-sample -spec: - # TODO(user): Add fields here diff --git a/config/samples/kustomization.yaml b/config/samples/kustomization.yaml deleted file mode 100644 index ea8f23f..0000000 --- a/config/samples/kustomization.yaml +++ /dev/null @@ -1,4 +0,0 @@ -## Append samples you want in your CSV to this file as resources ## -resources: -- _v1alpha1_metricset.yaml -#+kubebuilder:scaffold:manifestskustomizesamples diff --git a/controllers/metric/configmap.go b/controllers/metric/configmap.go index aa23394..4f5d439 100644 --- a/controllers/metric/configmap.go +++ b/controllers/metric/configmap.go @@ -11,8 +11,9 @@ import ( "context" "fmt" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" mctrl "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" @@ -20,11 +21,13 @@ import ( ctrl "sigs.k8s.io/controller-runtime" ) +// TODO this should take the final entrypoint scripts // ensureConfigMap ensures we've generated the read only entrypoints func (r *MetricSetReconciler) ensureConfigMaps( ctx context.Context, - set *api.MetricSet, - sets *map[string]mctrl.MetricSet, + spec *api.MetricSet, + set *mctrl.MetricSet, + containerSpecs []*specs.ContainerSpec, ) (*corev1.ConfigMap, ctrl.Result, error) { // Look for the config map by name @@ -32,8 +35,8 @@ func (r *MetricSetReconciler) ensureConfigMaps( err := r.Get( ctx, types.NamespacedName{ - Name: set.Name, - Namespace: set.Namespace, + Name: spec.Name, + Namespace: spec.Namespace, }, existing, ) @@ -45,18 +48,14 @@ func (r *MetricSetReconciler) ensureConfigMaps( // Prepare lookup of entrypoints, one per application/storage, // or possible multiple for a standalone metric data := map[string]string{} - count := 0 - for _, s := range *sets { - for _, es := range s.EntrypointScripts(set) { - key := es.Name - if key == "" { - key = fmt.Sprintf("entrypoint-%d", count) - } - data[key] = es.Script - } - count += 1 + + // Go through each container spec entrypoint + for _, cs := range containerSpecs { + r.Log.Info("⬜️ ConfigMaps", "Name", cs.EntrypointScript.Name, "Writing", cs) + data[cs.EntrypointScript.Name] = cs.EntrypointScript.WriteScript() } - cm, result, err := r.getConfigMap(ctx, set, data) + + cm, result, err := r.getConfigMap(ctx, spec, data) if err != nil { r.Log.Error( err, "🟥️ Failed to get config map", diff --git a/controllers/metric/metric.go b/controllers/metric/metric.go index 78fcd88..361271d 100644 --- a/controllers/metric/metric.go +++ b/controllers/metric/metric.go @@ -10,8 +10,9 @@ package controllers import ( "context" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" mctrl "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" "k8s.io/apimachinery/pkg/types" ctrl "sigs.k8s.io/controller-runtime" jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" @@ -21,19 +22,32 @@ import ( func (r *MetricSetReconciler) ensureMetricSet( ctx context.Context, spec *api.MetricSet, - sets *map[string]mctrl.MetricSet, + set *mctrl.MetricSet, ) (ctrl.Result, error) { - // First ensure config maps, typically entrypoints for custom metrics containers. - // They are all bound to the same config map (read only volume) - // and named by metric index or custom metric script key name - // We could theoretically allow creating more than one JobSet here - // and change the name to include the group type. - _, result, err := r.ensureConfigMaps(ctx, spec, sets) + // Ensure we create the JobSet for the MetricSet + // We get back container specs to use for generating configmaps + // This doesn't actually create the jobset + js, cs, result, exists, err := r.getJobSet(ctx, spec, set) + if err != nil { + return result, err + } + + // Now create config maps... + // The config maps need to exist before the jobsets, etc. + _, result, err = r.ensureConfigMaps(ctx, spec, set, cs) if err != nil { return result, err } + // And finally, the jobset + if !exists { + err = r.createJobSet(ctx, spec, js) + if err != nil { + return ctrl.Result{}, err + } + } + // Create headless service for the metrics set (which is a JobSet) // If we create > 1 JobSet, this should be updated selector := map[string]string{"metricset-name": spec.Name} @@ -42,13 +56,6 @@ func (r *MetricSetReconciler) ensureMetricSet( return result, err } - // Ensure we create the JobSet for the MetricSet - // either application, storage, or standalone based - // This could be updated to support > 1 - _, result, err = r.ensureJobSet(ctx, spec, sets) - if err != nil { - return result, err - } return ctrl.Result{}, nil } @@ -70,49 +77,40 @@ func (r *MetricSetReconciler) getExistingJob( return existing, err } -// getCluster does an actual check if we have a jobset in the namespace -func (r *MetricSetReconciler) ensureJobSet( +// getJobset retrieves the existing jobset (or generates the spec for a new one) +func (r *MetricSetReconciler) getJobSet( ctx context.Context, spec *api.MetricSet, - sets *map[string]mctrl.MetricSet, -) ([]*jobset.JobSet, ctrl.Result, error) { + set *mctrl.MetricSet, +) (*jobset.JobSet, []*specs.ContainerSpec, ctrl.Result, bool, error) { // Look for an existing job - // We only care about the set Name/Namespace matched to one - // This can eventually update to support > 1 if needed - existing, err := r.getExistingJob(ctx, spec) - jobsets := []*jobset.JobSet{existing} + js, err := r.getExistingJob(ctx, spec) + cs := []*specs.ContainerSpec{} // Create a new job if it does not exist if err != nil { + // TODO test checking for is not found error r.Log.Info( "✨ Creating a new Metrics JobSet ✨", "Namespace:", spec.Namespace, "Name:", spec.Name, ) - // Get one JobSet to create (can eventually support > 1) - jobsets, err := mctrl.GetJobSet(spec, sets) - if err != nil { - return jobsets, ctrl.Result{}, err - } - for _, js := range jobsets { - err = r.createJobSet(ctx, spec, js) - if err != nil { - return jobsets, ctrl.Result{}, err - } - } - return jobsets, ctrl.Result{}, err + // Get one JobSet and container specs to create config maps + js, cs, err := mctrl.GetJobSet(spec, set) + + // We don't create it here, we need configmaps first + return js, cs, ctrl.Result{}, false, err - } else { - r.Log.Info( - "🎉 Found existing Metrics JobSet 🎉", - "Namespace:", existing.Namespace, - "Name:", existing.Name, - ) } - return jobsets, ctrl.Result{}, err + r.Log.Info( + "🎉 Found existing Metrics JobSet 🎉", + "Namespace:", js.Namespace, + "Name:", js.Name, + ) + return js, cs, ctrl.Result{}, true, err } // createJobSet handles the creation operator diff --git a/controllers/metric/metric_controller.go b/controllers/metric/metric_controller.go index 744e71c..4785302 100644 --- a/controllers/metric/metric_controller.go +++ b/controllers/metric/metric_controller.go @@ -22,7 +22,7 @@ import ( "sigs.k8s.io/controller-runtime/pkg/log" jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" mctrl "github.com/converged-computing/metrics-operator/pkg/metrics" "github.com/go-logr/logr" ) @@ -104,65 +104,33 @@ func (r *MetricSetReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, nil } - // Verify that all metrics are valid. - // If the metric requires an application, the MetricSet CRD must have one! - // If the metric requires storage, the MetricSet CRD must defined storage - // Only one of application, storage, and standalone is required until - // we see a use case that warrants this be done differently. - metrics := []mctrl.Metric{} - - // We are allowed to create more that one MetricSet (JobSet) - sets := map[string]mctrl.MetricSet{} + // A MetricSet creates one or more JobSets (right now we just do 1) + set := mctrl.MetricSet{} for _, metric := range spec.Spec.Metrics { - // Get the individual metric, the type will determine the set we add it to + // Get the individual metric + r.Log.Info(fmt.Sprintf("🟦️ Looking for metric %s\n", metric.Name)) m, err := mctrl.GetMetric(&metric, &spec) if err != nil { r.Log.Error(err, fmt.Sprintf("🟥️ We had an issue loading that metric %s!", metric.Name)) return ctrl.Result{}, nil } - metricType := m.Type() - - // Determine if we've seen the MetricSet type yet, and add either way. - _, ok := sets[metricType] - if !ok { - ms, err := mctrl.GetMetricSet(metricType) - if err != nil { - r.Log.Info(fmt.Sprintf("🟥️ We cannot find a metricset type called %s!", metricType)) - return ctrl.Result{}, nil - } - sets[metricType] = ms - } - sets[metricType].Add(&m) - } - - // Ensure sets all have one or more metrics - for setName, set := range sets { - count := len(set.Metrics()) - if count == 0 { - r.Log.Info(fmt.Sprintf("🟥️ Metric set %s does not have any validated metrics.", setName)) - return ctrl.Result{}, nil - } - r.Log.Info(fmt.Sprintf("🟦️ Metric set %s has %d metrics.", setName, count)) - } - // Currently just support one JobSet per MetricSet - if len(sets) != 1 { - r.Log.Info(fmt.Sprintf("🟥️ Found %d metric sets, but exactly one is allowed to correspond to a final JobSet.", len(sets))) - return ctrl.Result{}, nil + // Add the metric to the set + set.Add(&m) } - // Currently just support one jobset for standalone - _, ok := sets[mctrl.StandaloneMetric] - if ok && len(metrics) > 1 { - r.Log.Info("🟥️ The standalone type metric, by definition, must be measured on its own and not with other metrics.") + // Ensure we have one or more metrics + count := len(set.Metrics()) + if count == 0 { + r.Log.Info(fmt.Sprintf("🟥️ Metric set %s in namespace %s does not have any validated metrics.", spec.Name, spec.Namespace)) return ctrl.Result{}, nil } + r.Log.Info(fmt.Sprintf("🟦️ Metric set %s in namespace %s has %d metrics.", spec.Name, spec.Namespace, count)) // Ensure the metricset is mapped to a JobSet. For design: // 1. If an application is provided, we pair the application at some scale with each metric as a contaienr // 2. If storage is provided, we create the volumes for the metric containers - // 3. If standalone is required, we create a JobSet with custom logic - result, err := r.ensureMetricSet(ctx, &spec, &sets) + result, err := r.ensureMetricSet(ctx, &spec, &set) if err != nil { r.Log.Error(err, "🟥️ Issue ensuring metric set") return result, err diff --git a/controllers/metric/service.go b/controllers/metric/service.go index df4792b..952c71b 100644 --- a/controllers/metric/service.go +++ b/controllers/metric/service.go @@ -17,7 +17,7 @@ import ( "k8s.io/apimachinery/pkg/api/errors" "k8s.io/apimachinery/pkg/types" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" ) // exposeService will expose services for job networking (headless) diff --git a/controllers/metric/suite_test.go b/controllers/metric/suite_test.go index eca7844..0faf94e 100644 --- a/controllers/metric/suite_test.go +++ b/controllers/metric/suite_test.go @@ -30,7 +30,7 @@ import ( logf "sigs.k8s.io/controller-runtime/pkg/log" "sigs.k8s.io/controller-runtime/pkg/log/zap" - fluxframeworkorgv1alpha1 "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" //+kubebuilder:scaffold:imports ) @@ -62,7 +62,7 @@ var _ = BeforeSuite(func() { Expect(err).NotTo(HaveOccurred()) Expect(cfg).NotTo(BeNil()) - err = fluxframeworkorgv1alpha1.AddToScheme(scheme.Scheme) + err = api.AddToScheme(scheme.Scheme) Expect(err).NotTo(HaveOccurred()) //+kubebuilder:scaffold:scheme diff --git a/docs/_static/data/addons.html b/docs/_static/data/addons.html new file mode 100644 index 0000000..9202b95 --- /dev/null +++ b/docs/_static/data/addons.html @@ -0,0 +1,477 @@ + + + + + + + + + + + + Metrics Operator -- Metrics + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + +
NameFamilyDescription
+
+ + + + + + + + + diff --git a/docs/_static/data/addons.json b/docs/_static/data/addons.json new file mode 100644 index 0000000..ab039ec --- /dev/null +++ b/docs/_static/data/addons.json @@ -0,0 +1,37 @@ +[ + { + "name": "application", + "description": "basic application (container) type", + "family": "application" + }, + { + "name": "perf-hpctoolkit", + "description": "performance tools for measurement and analysis", + "family": "performance" + }, + { + "name": "volume-cm", + "description": "config map volume type", + "family": "volume" + }, + { + "name": "volume-empty", + "description": "empty volume type", + "family": "volume" + }, + { + "name": "volume-hostpath", + "description": "host path volume type", + "family": "volume" + }, + { + "name": "volume-pvc", + "description": "persistent volume claim volume type", + "family": "volume" + }, + { + "name": "volume-secret", + "description": "secret volume type", + "family": "volume" + } +] \ No newline at end of file diff --git a/docs/_static/data/metrics.json b/docs/_static/data/metrics.json index 05cfcf4..b5250ba 100644 --- a/docs/_static/data/metrics.json +++ b/docs/_static/data/metrics.json @@ -3,7 +3,6 @@ "name": "app-amg", "description": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", "family": "solver", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-amg:latest", "url": "https://github.com/LLNL/AMG" }, @@ -11,7 +10,6 @@ "name": "app-bdas", "description": "The big data analytic suite contains the K-Means observation label, PCA, and SVM benchmarks.", "family": "machine-learning", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-bdas:latest", "url": "https://asc.llnl.gov/sites/asc/files/2020-09/BDAS_Summary_b4bcf27_0.pdf" }, @@ -19,7 +17,6 @@ "name": "app-hpl", "description": "High-Performance Linpack (HPL)", "family": "solver", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-hpl-spack:latest", "url": "https://www.netlib.org/benchmark/hpl/" }, @@ -27,7 +24,6 @@ "name": "app-kripke", "description": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", "family": "solver", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-kripke:latest", "url": "https://github.com/LLNL/Kripke" }, @@ -35,7 +31,6 @@ "name": "app-laghos", "description": "LAGrangian High-Order Solver", "family": "solver", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-laghos:latest", "url": "https://github.com/CEED/Laghos" }, @@ -43,7 +38,6 @@ "name": "app-lammps", "description": "LAMMPS molecular dynamic simulation", "family": "simulation", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-lammps:latest", "url": "https://www.lammps.org/" }, @@ -51,7 +45,6 @@ "name": "app-ldms", "description": "provides LDMS, a low-overhead, low-latency framework for collecting, transferring, and storing metric data on a large distributed computer system.", "family": "performance", - "type": "application", "image": "ghcr.io/converged-computing/metric-ovis-hpc:latest", "url": "https://github.com/ovis-hpc/ovis" }, @@ -59,7 +52,6 @@ "name": "app-nekbone", "description": "A mini-app derived from the Nek5000 CFD code which is a high order, incompressible Navier-Stokes CFD solver based on the spectral element method. The conjugate gradiant solve is compute intense, contains small messages and frequent allreduces.", "family": "solver", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-nekbone:latest", "url": "https://github.com/Nek5000/Nekbone" }, @@ -67,7 +59,6 @@ "name": "app-pennant", "description": "Unstructured mesh hydrodynamics for advanced architectures ", "family": "simulation", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-pennant:latest", "url": "https://github.com/LLNL/pennant" }, @@ -75,7 +66,6 @@ "name": "app-quicksilver", "description": "A proxy app for the Monte Carlo Transport Code", "family": "simulation", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-quicksilver:latest", "url": "https://github.com/LLNL/Quicksilver" }, @@ -83,7 +73,6 @@ "name": "io-fio", "description": "Flexible IO Tester (FIO)", "family": "storage", - "type": "storage", "image": "ghcr.io/converged-computing/metric-fio:latest", "url": "https://fio.readthedocs.io/en/latest/fio_doc.html" }, @@ -91,7 +80,6 @@ "name": "io-ior", "description": "HPC IO Benchmark", "family": "storage", - "type": "storage", "image": "ghcr.io/converged-computing/metric-ior:latest", "url": "https://github.com/hpc/ior" }, @@ -99,7 +87,6 @@ "name": "io-sysstat", "description": "statistics for Linux tasks (processes) : I/O, CPU, memory, etc.", "family": "storage", - "type": "storage", "image": "ghcr.io/converged-computing/metric-sysstat:latest", "url": "https://github.com/sysstat/sysstat" }, @@ -107,7 +94,6 @@ "name": "network-chatterbug", "description": "A suite of communication proxies for HPC applications", "family": "network", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-chatterbug:latest", "url": "https://github.com/hpcgroup/chatterbug" }, @@ -115,7 +101,6 @@ "name": "network-netmark", "description": "point to point networking tool", "family": "network", - "type": "standalone", "image": "vanessa/netmark:latest", "url": "" }, @@ -123,23 +108,13 @@ "name": "network-osu-benchmark", "description": "point to point MPI benchmarks", "family": "network", - "type": "standalone", "image": "ghcr.io/converged-computing/metric-osu-benchmark:latest", "url": "https://mvapich.cse.ohio-state.edu/benchmarks/" }, - { - "name": "perf-hpctoolkit", - "description": "performance tools for measurement and analysis", - "family": "performance", - "type": "application", - "image": "ghcr.io/converged-computing/metric-hpctoolkit-view:latest", - "url": "https://gitlab.com/hpctoolkit/hpctoolkit" - }, { "name": "perf-sysstat", "description": "statistics for Linux tasks (processes) : I/O, CPU, memory, etc.", "family": "performance", - "type": "application", "image": "ghcr.io/converged-computing/metric-sysstat:latest", "url": "https://github.com/sysstat/sysstat" } diff --git a/docs/_static/data/table.html b/docs/_static/data/table.html index ec9a672..ca9a248 100644 --- a/docs/_static/data/table.html +++ b/docs/_static/data/table.html @@ -415,7 +415,6 @@ Name - Type Family Description Container @@ -440,7 +439,6 @@ } return "" + data +"";}, }, - { data: "type"}, { data: "family"}, { data: "description"}, { data: "image", @@ -457,22 +455,22 @@ 'rowCallback': function(row, data, index){ // Distinguish family if(data.family == 'storage'){ - $(row).find('td:eq(2)').css('background-color', 'lavender'); + $(row).find('td:eq(1)').css('background-color', 'lavender'); } if(data.family == 'solver'){ - $(row).find('td:eq(2)').css('background-color', 'lightgreen'); + $(row).find('td:eq(1)').css('background-color', 'lightgreen'); } if(data.family == 'performance'){ - $(row).find('td:eq(2)').css('background-color', '#f79fb7'); + $(row).find('td:eq(1)').css('background-color', '#f79fb7'); } if(data.family == 'machine-learning'){ - $(row).find('td:eq(2)').css('background-color', 'palegoldenrod'); + $(row).find('td:eq(1)').css('background-color', 'palegoldenrod'); } if(data.family == 'network'){ - $(row).find('td:eq(2)').css('background-color', 'orange'); + $(row).find('td:eq(1)').css('background-color', 'orange'); } if(data.family == 'simulation'){ - $(row).find('td:eq(2)').css('background-color', 'skyblue'); + $(row).find('td:eq(1)').css('background-color', 'skyblue'); } } }); diff --git a/docs/development/designs/current.md b/docs/development/designs/current.md new file mode 100644 index 0000000..60aefcc --- /dev/null +++ b/docs/development/designs/current.md @@ -0,0 +1,58 @@ +# Current Design + +For this second design, we can more easily say: + +> A Metric Set is a collection of metrics to measure IO, performance, or networking that can be customized with addons. + +The original design was a good first shot, but was flawed in several ways: + +1. I could not combined metrics into one. E.g., if I wanted to use a launcher jobset design combined with HPCToolkit, another metric, I could not. +2. The top level set types - standalone, application, and storage, didn't have much meaning. +3. The use of Storage, Application, and Volume was messy at best (external entities to add to a metric set) + +For this second design, the "MetricSet" is still mirroring the design of a JobSet, but it is more generic, and of one type. There are no longer different +flavors of metric sets. Rather, we allow metrics to generate replicated jobs. For the "extras" that we need to integrate to supplement those jobs - e.g., applications, volumes/storage, or +even extra containers that add logic, these are now called metric addons. More specifically, an addon can: + + - Add extra containers (and config maps for their entrypoints) + - Add custom logic to entrypoints for specific jobs and/or containers + - Add additional volumes that range the gamut from empty to persistent disk. + +The current design allows only one JobSet per metrics.yaml, and this was an explicit choice after realizing that it's unlikely to want more than one. + +## Kubernetes Abstractions + +We use a JobSet on the top level with Replica set to 1, and within that set, each metric is allowed to create one or more ReplcatedJob. We can easily customize the style of the replicated job based +on interfacs. E.g.,: + +- The `LauncherWorker` is a typical design that might have a launcher and MPI hostlist written, and a main command run there to then interact with the workers. +- The `SingleApplication` is a basic design that expects one or more pods in an indexed job, and also shares the process namespace. +- The `StorageGeneric` is almost the same, but doesn't share a process namespace. + +I haven't found a need for another kind of design yet (most are the launcher worker type) but can easily add them if needed. +There is no longer any distinction between MetricSet types, as there is only one MetricSet that serves as a shell from the metric. + +## Output Options + +### Logging Parser + +For the simplest start, I've decided to allow for metrics to have their own custom output (indeed it would be hard to standardize this between so many different tools) but have the operator +provide structure to that, meaning separators to distinguish sections, and a consistent way to output metadata. As an example, here is what the top level metadata and sections (with some custom output data between) +would look like: + +```console +METADATA START {"pods":1,"completions":1,"storageVolumePath":"/workflow","storageVolumeHostPath":"/tmp/workflow","metricName":"io-sysstat","metricDescription":"statistics for Linux tasks (processes) : I/O, CPU, memory, etc.","metricType":"storage","metricOptions":{"completions":2,"human":"false","rate":10}} +METADATA END +METRICS OPERATOR COLLECTION START +METRICS OPERATOR TIMEPOINT +...custom data output here for timepoint 1... +METRICS OPERATOR TIMEPOINT +...custom data output here for timepoint 2... +METRICS OPERATOR TIMEPOINT +...custom data output here for timepoint N... +METRICS OPERATOR COLLECTION END +``` + +In the above, we can parse the metadata for the run from the first line (a subset of flattened, important features dumped in json) and then clearly mark the start and end of collection, +along with separation between timepoints. This is the most structure we can provide, as each metric output looks different. It's up to the Python module parser from the "metricsoperator" +module to know how to parse (and possibly plot) any specific output type. \ No newline at end of file diff --git a/docs/development/designs.md b/docs/development/designs/design1.md similarity index 100% rename from docs/development/designs.md rename to docs/development/designs/design1.md diff --git a/docs/development/img/application-metric-set.png b/docs/development/designs/img/application-metric-set.png similarity index 100% rename from docs/development/img/application-metric-set.png rename to docs/development/designs/img/application-metric-set.png diff --git a/docs/development/img/application-metric-volume.png b/docs/development/designs/img/application-metric-volume.png similarity index 100% rename from docs/development/img/application-metric-volume.png rename to docs/development/designs/img/application-metric-volume.png diff --git a/docs/development/img/standalone-metric-set.png b/docs/development/designs/img/standalone-metric-set.png similarity index 100% rename from docs/development/img/standalone-metric-set.png rename to docs/development/designs/img/standalone-metric-set.png diff --git a/docs/development/img/storage-metric-set.png b/docs/development/designs/img/storage-metric-set.png similarity index 100% rename from docs/development/img/storage-metric-set.png rename to docs/development/designs/img/storage-metric-set.png diff --git a/docs/development/designs/index.md b/docs/development/designs/index.md new file mode 100644 index 0000000..57b9b72 --- /dev/null +++ b/docs/development/designs/index.md @@ -0,0 +1,10 @@ +# Designs + +The Metrics Operator has had several designs (and re-designs, which is typical of the developer). +This small set of files details the history. + +```{toctree} +:maxdepth: 3 +current +design1 +``` diff --git a/docs/development/developer-guide.md b/docs/development/developer-guide.md index 7038656..45ac890 100644 --- a/docs/development/developer-guide.md +++ b/docs/development/developer-guide.md @@ -121,32 +121,39 @@ This section will include instructions for how to write a metrics container. ### General Instructions -We provide templates for different kinds of JobSet (e.g., SingleApplication vs. LauncherWorker pattern) in pkg/jobs, -so the easiest thing to do is to find the template that is closest in design to what you want, and then -copy a metric go file from `pkg/metrics/*` that is closest. You will need to: +Metrics largely have functionality that comes from shared interfaces, such as a `LauncherWorker` +design that has a main node launcher tasks, and some number of worker nodes, and basic interfaces +for storage and applications. The best thing to do is explore the current metrics, find one that +is similar to what you want to do, and use it as a template. As long as you put it in a known group +directory, e.g., these: - - Change the interface struct name +```bash +pkg/metrics/ +├── app +├── io +├── network +└── perf +``` + +It will be discovered and registered and available for use. + +You will generally need to: + + - Change the interface struct name depending on what you need - Update parameters /options for your needs - Change the URL, and the metadata at the bottom (container, description, identifier) -The main logic for a metric will be in the function to `GenerateEntrypoints`. For development, +The main logic for a metric will be in the function to `PrepareContainers`. For development, I find it easiest to build the container first (as an automated build), have a general sense how to -run the metric, and then insert `sleep infinity` into the launcher (or primary) script in that function, -and interactively develop. When you do this, you'll also want to: +run the metric, create a `metrics.yaml` for it, and then insert `sleep infinity` +(or set logging->interactive to true) to interactively develop. When you do this, you'll also want to: - Add a named example in `examples/tests` - Run `make pre-push` before pushing to update docs metadata - Run `make html` and cd into `_build/html` and `python -m http.server 9999` (and open to that port) to preview - The metrics.html page under getting started shows the metadata that is rendered from the code. You may need to make the frame taller. -### Performance via PID - -For a perf metric, you can assume that your metric container will be run as a sidecar, -and have access to the PID namespace of the application container. - -- They should contain wget if they need to download the wait script in the entrypoint - -WIP +For addons, the same logic applies, but you will want to add content to `pkg/addons` instead. ## Documentation diff --git a/docs/development/index.md b/docs/development/index.md index 29946a2..149985b 100644 --- a/docs/development/index.md +++ b/docs/development/index.md @@ -7,7 +7,7 @@ any questions, please [let us know](https://github.com/converged-computing/metri ```{toctree} :maxdepth: 3 developer-guide -designs +designs/index.md metrics debugging creation diff --git a/docs/development/metrics.md b/docs/development/metrics.md index 31795ea..cb90af2 100644 --- a/docs/development/metrics.md +++ b/docs/development/metrics.md @@ -6,7 +6,6 @@ These are metrics that are consistered under development (and likely need more e ### network-chatterbug - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[network-chatterbug](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/network-chatterbug)* Chatterbug provides a [suite of communication proxy applications](https://github.com/hpcgroup/chatterbug) for HPC. @@ -50,7 +49,6 @@ See the example linked in the header for a metrics.yaml example. ### app-hpl - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-hpl](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-hpl)* The [Linpack](https://ulhpc-tutorials.readthedocs.io/en/production/parallel/mpi/HPL/) benchmark is used for the [Top500](https://www.top500.org/project/linpack/), @@ -88,7 +86,7 @@ script help below:
-`compute_N --help` +compute_N --help ```console # compute_N -h diff --git a/docs/getting_started/addons.md b/docs/getting_started/addons.md new file mode 100644 index 0000000..c56eb75 --- /dev/null +++ b/docs/getting_started/addons.md @@ -0,0 +1,278 @@ +# Addons + +An addon is a generic way to customize a metric. An addon can do everything from: + +- generating new application or sidecar containers +- adding volumes, including writing new config maps +- customizing entrypoints, to the granularity of a jobset or a container + +And as an example, if you wanted to use an IO benchmark, you would test that against different storage +solutions by way of using a volume added. The different groups available are discussed below, and if you +have a request for an addon please [let us know](https://github.com/converged-computing/metrics-operator/issues). + + + +## Existing Volumes + +An existing volume addon can be provided to a metric. As an example, it would make sense to run an IO benchmarks with +different kinds of volume addons. The addons for volumes currently include: + + - a persistent volume claim (PVC) and persistent volume (PV) that you've created + - a secret that you've created + - a config map that you've created + - a host volume (typically for testing) + - an empty volume + +and for all of the above, you want to create it and provide metadata for the addon to the operator, which will ensure the volume is available for your metric. We will provide examples here to do that. + +#### persistent volume claim addon + +As an example, here is how to provide the name of an existing claim (you created separately) to a metric container: +TODO add support to specify a specific metric container or replicated job container, if applicable. + +```yaml +spec: + metrics: + - name: app-lammps + addons: + # This name is a unique identifier for this addon + - name: volume-pvc + options: + name: data + claimName: data + path: /workflow +``` + +The above would add a claim named "data" to the metric container(s). + +#### config map addon example + +Here is an example of providing a config map to an application container In layman's terms, we are deploying vanilla nginx, but adding a configuration file +to `/etc/nginx/conf.d` + +```yaml +spec: + metrics: + - name: app-lammps + addons: + # This name is a unique identifier for this addon + - name: volume-cm + options: + name: nginx-conf + configMapName: nginx-conf + path: /etc/nginx/conf.d + mapOptions: + items: + flux.conf: flux.conf +``` + +You would have created this config map first, before the MetricSet. Here is an example: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: nginx-conf + namespace: metrics-operator +data: + flux.conf: | + server { + listen 80; + server_name localhost; + location / { + root /usr/share/nginx/html; + index index.html index.htm; + } + } +``` + +#### secret addon example + +Here is an example of providing an existing secret (in the metrics-operator namespace) +to the metric container(s): + +```yaml +spec: + metrics: + - name: app-lammps + addons: + # This name is a unique identifier for this addon + - name: volume-secret + options: + name: certs + path: /etc/certs + secretName: certs +``` + +The above shows an existing secret named "certs" that we will mount into `/etc/certs`. + +#### hostpath volume addon example + +Here is how to use a host path: + +```yaml +spec: + metrics: + - name: app-lammps + addons: + # This name is a unique identifier for this addon + - name: volume-hostpath + options: + name: data + hostPath: /path/on/host + path: /path/in/container +``` + + +TODO convert to addon logic + +### application + +When you want to measure application performance, you'll need to add an "application" section to your MetricSet. This is the container that houses some application that you want to measure performance for. This means that minimally, you are required to define the application container image and command: + + +```yaml +spec: + application: + image: ghcr.io/rse-ops/vanilla-lammps:tag-latest + command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite +``` + +In the above example, we target a container with LAMMPS and mpi, and we are going to run MPIrun. +The command will be used by the metrics sidecar containers to find the PID of interest to measure. + +#### workingDir + +To add a working directory for your application: + +```yaml +spec: + application: + image: ghcr.io/rse-ops/vanilla-lammps:tag-latest + command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite + workingDir: /opt/lammps/examples/reaxff/HNS +``` + +#### volumes + +An application is allowed to have one or more existing volumes. An existing volume can be any of the types described in [existing volumes](#existing-volumes) + +#### resources + +You can define resources for an application or a metric container. Known keys include "memory" and "cpu" (should be provided in some string format that can be parsed) and all others are considered some kind of quantity request. + +```yaml +application: + resources: + memory: 500M + cpu: 4 +``` + +Metrics can also take resource requests. + +```yaml +metrics: + - name: io-fio + resources: + memory: 500M + cpu: 4 +``` + +If you wanted to, for example, request a GPU, that might look like: + +```yaml +resources: + limits: + gpu-vendor.example/example-gpu: 1 +``` + +Or for a particular type of networking fabric: + +```yaml +resources: + limits: + vpc.amazonaws.com/efa: 1 +``` + +Both limits and resources are flexible to accept a string or an integer value, and you'll get an error if you +provide something else. If you need something else, [let us know](https://github.com/converged-computing/metrics-operator/issues). +If you are requesting GPU, [this documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is helpful. + +### storage + +When you want to measure some storage performance, you'll want to add a "storage" section to your MetricSet. This will typically just be a reference to some existing storage (see [existing volumes](#existing-volumes)) that we want to measure, and can also be done for some number of completions and metrics for storage. + +#### commands + +If you need to add some special logic to create or cleanup for a storage volume, you are free to define them for storage in each of pre and post sections, which will happen before and after the metric runs, respectively. + +```yaml +storage: + volume: + claimName: data + path: /data + commands: + pre: | + apt-get update && apt-get install -y mymounter-tool + mymounter-tool mount /data + post: mymounter-tool unmount /data + # Wrap the storage metric in this prefix + prefix: myprefix +``` + +All of the above are strings. The pipe allows for multiple lines, if appropriate. +Note that while a "volume" is typical, you might have a storage setup that is done via a set of custom commands, in which case +you don't need to define the volume too. + +## Performance + +### perf-hpctoolkit + + - *[perf-hpctoolkit](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/perf-lammps-hpctoolkit)* + +This metric provides [HPCToolkit](https://gitlab.com/hpctoolkit/hpctoolkit) for your application to use. This is the first metric of its type +to use a shared volume approach. Specifically, we: + +- add a new ability for an application metric to define an empty volume, and have the metrics container copy stuff to it +- also add an ability for this kind of application metric to customize the application entrypoint (e.g., copy volume contents to destinations) +- build a spack copy view into the [hpctoolkit metrics container](https://github.com/converged-computing/metrics-containers/blob/main/hpctoolkit-containerize/Dockerfile) +- move the `/opt/software` and `/opt/views/view` roots into the application container, this is a modular install of HPCToolkit. +- copy over `/opt/share/software` (provided via the shared empty volume) to `/opt/software`` where spack expects it. We also add `/opt/share/view/bin` to the path (where hpcrun is) + +After those steps are done, HPCToolkit is essentially installed, on the fly, in the application container. Since the `hpcrun` command is using `LD_AUDIT` we need +all libraries to be in the same system (the shared process namespace would not work). We can then run it, and generate a database. Also note that by default, +we run the post-analysis steps (shown below) and also provide them in each container as `post-run.sh`, which the addon will run for you, unless you +set `postAnalysis` to "false." Finally, if you need to run it manually, here is an example +given `hpctoolkit-lmp-measurements` in the present working directory of the container. + + +```bash +hpcstruct hpctoolkit-lmp-measurements + +# Run "the professor!" 🤓️ +hpcprof hpctoolkit-lmp-measurements +``` + +The above generates a database, `hpctoolkit-lmp-database` that you can copy to your machine for further interaction with hpcviewer +(or some future tool that doesn't use Java)! + +```bash +kubectl cp -c app metricset-sample-m-0-npbc9:/opt/lammps/examples/reaxff/HNS/hpctoolkit-lmp-database hpctoolkit-lmp-database +hpcviewer ./hpctoolkit-lmp-database +``` + +Here are the acceptable parameters. + +| Name | Description | Type | Default | +|-----|-------------|------------|------| +| mount | Path to mount hpctoolview view in application container | string | /opt/share | +| events | Events for hpctoolkit | string | `-e IO` | +| image | Customize the container image | string | `ghcr.io/converged-computing/metric-hpctoolkit-view:ubuntu` | +| output | The output directory for hpcrun (database will generate to *-database) | string | hpctoolkit-result | + +Note that for image we also provide a rocky build base, `ghcr.io/converged-computing/metric-hpctoolkit-view:rocky`. +You can also see events available with `hpcrun -L`, and use the container for this metric. +There is a brief listing on [this page](https://hpc.llnl.gov/software/development-environment-software/hpc-toolkit). +We recommend that you do not pair hpctoolkit with another metric, primarily because it is customizing the application +entrypoint. If you add a process-namespace based metric, you likely need to account for the hpcrun command being the +wrapper to the actual executable. diff --git a/docs/getting_started/custom-resource-definition.md b/docs/getting_started/custom-resource-definition.md index 5429951..1d17ad9 100644 --- a/docs/getting_started/custom-resource-definition.md +++ b/docs/getting_started/custom-resource-definition.md @@ -18,7 +18,7 @@ The yaml spec will normally have an API version, the kind `MetricSet` and then a name and (optionally, a namespace) to identify the custom resource definition followed by the spec for it. Here is a spec that will deploy to the `default` namespace: ```yaml -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -31,23 +31,13 @@ spec: ### Spec -Under the spec, there are several variables to define. Descriptions are included below, and we recommend that you look at [examples](https://github.com/converged-computing/metrics-operator/tree/main/examples) in the repository for more. +Under the spec, there are several variables to define. Descriptions are included below, and we recommend that you look at [examples](https://github.com/converged-computing/metrics-operator/tree/main/examples) in the repository for more. Note that the general design takes one or more metrics, and each metric can have additional addons for storage volumes, additional containers, or other addon types. Specifically, you must choose ONE of: - - application - - storage - -Where an application will be run for some number of pods (completions) and measured by metrics pods (separate pods) OR a storage metric will run directly, and with some -number of pods (completions) to bind to the storage and measure. - ### pods The number of pods for an application or storage metric test will correspond with the parallelism of the indexed job (which comes down to pods) for the storage or application JobSet. This defaults to 1, meaning we run in a non-indexed mode. The indexed mode is determined automatically by this variable, where "1" indicates non-indexed, and >1 is indexed. -### completions - -When running as an indexed job, indicate the number of successful pods (completions) for the Job to be successful. Note that if you set 1, your parallelism will default to 1 too, which isn't ideal. I've opened an issue [here](https://github.com/kubernetes-sigs/jobset). - ### logging We are anticipating adding more logging options, but for not logging exposes one "interactive" option that will add a "sleep infinity" to the end of a storage, performance, or standalone metric. @@ -73,104 +63,6 @@ spec: By default it is false, meaning we use fully qualified domain names. -### application - -When you want to measure application performance, you'll need to add an "application" section to your MetricSet. This is the container that houses some application that you want to measure performance for. This means that minimally, you are required to define the application container image and command: - - -```yaml -spec: - application: - image: ghcr.io/rse-ops/vanilla-lammps:tag-latest - command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite -``` - -In the above example, we target a container with LAMMPS and mpi, and we are going to run MPIrun. -The command will be used by the metrics sidecar containers to find the PID of interest to measure. - -#### workingDir - -To add a working directory for your application: - -```yaml -spec: - application: - image: ghcr.io/rse-ops/vanilla-lammps:tag-latest - command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite - workingDir: /opt/lammps/examples/reaxff/HNS -``` - -#### volumes - -An application is allowed to have one or more existing volumes. An existing volume can be any of the types described in [existing volumes](#existing-volumes) - -#### resources - -You can define resources for an application or a metric container. Known keys include "memory" and "cpu" (should be provided in some string format that can be parsed) and all others are considered some kind of quantity request. - -```yaml -application: - resources: - memory: 500M - cpu: 4 -``` - -Metrics can also take resource requests. - -```yaml -metrics: - - name: io-fio - resources: - memory: 500M - cpu: 4 -``` - -If you wanted to, for example, request a GPU, that might look like: - -```yaml -resources: - limits: - gpu-vendor.example/example-gpu: 1 -``` - -Or for a particular type of networking fabric: - -```yaml -resources: - limits: - vpc.amazonaws.com/efa: 1 -``` - -Both limits and resources are flexible to accept a string or an integer value, and you'll get an error if you -provide something else. If you need something else, [let us know](https://github.com/converged-computing/metrics-operator/issues). -If you are requesting GPU, [this documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is helpful. - -### storage - -When you want to measure some storage performance, you'll want to add a "storage" section to your MetricSet. This will typically just be a reference to some existing storage (see [existing volumes](#existing-volumes)) that we want to measure, and can also be done for some number of completions and metrics for storage. - -#### commands - -If you need to add some special logic to create or cleanup for a storage volume, you are free to define them for storage in each of pre and post sections, which will happen before and after the metric runs, respectively. - -```yaml -storage: - volume: - claimName: data - path: /data - commands: - pre: | - apt-get update && apt-get install -y mymounter-tool - mymounter-tool mount /data - post: mymounter-tool unmount /data - # Wrap the storage metric in this prefix - prefix: myprefix -``` - -All of the above are strings. The pipe allows for multiple lines, if appropriate. -Note that while a "volume" is typical, you might have a storage setup that is done via a set of custom commands, in which case -you don't need to define the volume too. - ### metrics The core of the MetricSet of course is the metrics! Since we can measure more than one thing at once, this is a list of named metrics known to the operator. As an example, here is how to run the `perf-sysstat` metric: @@ -181,37 +73,20 @@ spec: - name: perf-sysstat ``` -To see all the metrics available, see [metrics](metrics.md). We will be adding many more as the operator is developed. - -#### rate - -A metric will be collected at some rate (in seconds) and this defaults to 10. -To change the rate for a metric: +For any metric, advanced users might want to set a custom container. This is done at your own expertise (and risk): ```yaml spec: metrics: - name: perf-sysstat - rate: 20 + container: ghcr.io/my-github/my-sysstat-container:latest ``` -#### completions - -Completions for a metric are relevant if you are assessing storage (which doesn't have an application runtime) or a service application that will continue to run forever. When this value is set to 0, it essentially indicates no set number of completions (meaning we run forever). Any non-zero value will ensure the metric -runs for that many completions before exiting. - -```yaml -spec: - metrics: - - name: io-sysstat - completions: 5 -``` - -This is usually suggested to provide for a storage metric. +To see all the metrics available, see [metrics](metrics.md). We will be adding many more as the operator is developed. #### options -Metrics can take custom options, which are key value pairs of a string key and either string or integer value. These come in three types: +Generally, the specific parameters for any given metric are defined via the options, including: - options (key value pairs, where the value is an integer/string type) - listOptions (key value pairs, where the value is a list of integer/string types) @@ -236,136 +111,27 @@ spec: ``` Presence of absence of an option type depends on the metric. Metrics are free to use these custom -options as they see fit. - +options as they see fit, and validate in the same manner. -## Existing Volumes +#### addons -An existing volume can be provided to support an application (multiple) or one can be provided for assessing its performance (single). +An addon is a flexible interface to define everything from volumes to containers to be deployed alongside the metric. +If you are curious, a metric will generate one or more replicated Jobs in a Jobset, and the addon is free to customize these. +Akin to [metric options](#options) addons support the same types: - - a persistent volume claim (PVC) and persistent volume (PV) that you've created - - a secret that you've created - - a config map that you've created - - a host volume (typically for testing) + - options + - listOptions + - mapOptions -and for all of the above, you want to provide it to the operator, which will ensure the volume is available for your application or storage. For an application, you'd define your volumes as such: +As an example, here is a metric with a few named addons - an empty volume, and adding hpctoolkit to run alongside lammps. ```yaml -spec: - application: - image: ghcr.io/rse-ops/vanilla-lammps:tag-latest - command: nginx -g daemon off; - volumes: - data: - path: /workflow - claimName: data -``` - -The use case above, for an application, is that it requires some kind of data or storage alongside it to function. The volumes spec above is a key value (e.g., "data" is the key) to ensure that names are unique. For storage, you'll only be defining one volume: - -```yaml -spec: - storage: - volume: - path: /workflow - claimName: data -``` - -And the implicit name would be "storage" (although it's probably not important for you to know that). For the remaining examples, we will provide examples for application volumes, however know that the examples are also valid for the second -storage format. - -#### persistent volume claim example - -As an example, here is how to provide the name of an existing claim (you created separately) to a container: - -```yaml -spec: - application: - image: ghcr.io/rse-ops/vanilla-lammps:tag-latest - command: nginx -g daemon off; - - # This is an existing PVC (and associated PV) we created before the MetricSet - volumes: - data: - path: /workflow - claimName: data -``` - -The above would add a claim named "data" to the application container(s). - -#### config map example - -Here is an example of providing a config map to an application container In layman's terms, we are deploying vanilla nginx, but adding a configuration file -to `/etc/nginx/conf.d` - -```yaml -spec: - application: - image: nginx - command: nginx -g daemon off; - - # This is an existing PVC (and associated PV) we created before the MetricSet - volumes: - nginx-conf: - configMapName: nginx-conf - path: /etc/nginx/conf.d - items: - flux.conf: flux.conf -``` - - -You would have created this config map first, before the MetricSet. Here is an example: - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: nginx-conf - namespace: metrics-operator -data: - flux.conf: | - server { - listen 80; - server_name localhost; - location / { - root /usr/share/nginx/html; - index index.html index.htm; - } - } -``` - -#### secret example - -Here is an example of providing an existing secret (in the metrics-operator namespace) -to the application container(s): - -```yaml -spec: - application: - image: nginx - command: nginx -g daemon off; - - volumes: - certs: - path: /etc/certs - secretName: certs +metrics: + - name: app-lammps + addons: + - name: volume-empty + - name: perf-hpctoolkit ``` -The above shows an existing secret named "certs" that we will mount into `/etc/certs`. - -#### hostpath volume example - -Here is how to use a host path: - -```yaml -spec: - application: - image: ghcr.io/rse-ops/vanilla-lammps:tag-latest - command: nginx -g daemon off; - - # This is an existing PVC (and associated PV) we created before the MetricSet - volumes: - data: - hostPath: true - path: /workflow -``` +Each addon has its own custom options. You can look at examples and at our [addons documentation](addons.md) for more detail on how to add existing volumes +or other custom functionality. diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md index 3a72a7c..70f8a94 100644 --- a/docs/getting_started/index.md +++ b/docs/getting_started/index.md @@ -8,5 +8,6 @@ This documentation includes a developer guide, and user guide. If you have any q :maxdepth: 2 user-guide metrics +addons custom-resource-definition ``` diff --git a/docs/getting_started/metrics.md b/docs/getting_started/metrics.md index 4a0065f..77f02bf 100644 --- a/docs/getting_started/metrics.md +++ b/docs/getting_started/metrics.md @@ -3,76 +3,17 @@ The following metrics are under development (or being planned). - [Examples](https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#examples) - - [Storage Metrics](https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#storage) - - [Application Metrics](https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#application) - - [Standalone Metrics](https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#standalone) -Each of the above is a metric design, which is primarily represented in the Metrics Operator code. However, within each design -there are different families of metrics (e.g., storage, network, performance, simulation) shown in the table below as the "Family" column. +Each metric can be ascribed to a high level family, shown in the table below as the "Family" column. We likely will tweak and improve upon these categories. - + ## Implemented Metrics -Each metric has a link to the type, along with (optionally) examples. These sections will better be organized by -family once we decide on a more final set. +### perf-sysstat -### Performance - -These metrics are intended to assess application performance, where they run alongside an application of interest. - -#### perf-hpctoolkit - - - [Application Metric Set](user-guide.md#application-metric-set) - - *[perf-hpctoolkit](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/perf-hpctoolkit)* - -This metric provides [HPCToolkit](https://gitlab.com/hpctoolkit/hpctoolkit) for your application to use. This is the first metric of its type -to use a shared volume approach. Specifically, we: - -- add a new ability for an application metric to define an empty volume, and have the metrics container copy stuff to it -- also add an ability for this kind of application metric to customize the application entrypoint (e.g., copy volume contents to destinations) -- build a spack copy view into the [hpctoolkit metrics container](https://github.com/converged-computing/metrics-containers/blob/main/hpctoolkit-containerize/Dockerfile) -- move the `/opt/software` and `/opt/views/view` roots into the application container, this is a modular install of HPCToolkit. -- copy over `/opt/share/software` (provided via the shared empty volume) to `/opt/software`` where spack expects it. We also add `/opt/share/view/bin` to the path (where hpcrun is) - -After those steps are done, HPCToolkit is essentially installed, on the fly, in the application container. Since the `hpcrun` command is using `LD_AUDIT` we need -all libraries to be in the same system (the shared process namespace would not work). We can then run it, and generate a database. Here is an example -given `hpctoolkit-lmp-measurements` in the present working directory of the container. - - -```bash -hpcstruct hpctoolkit-lmp-measurements - -# Run "the professor!" 🤓️ -hpcprof hpctoolkit-lmp-measurements -``` - -The above generates a database, `hpctoolkit-lmp-database` that you can copy to your machine for further interaction with hpcviewer -(or some future tool that doesn't use Java)! - -```bash -kubectl cp -c app metricset-sample-m-0-npbc9:/opt/lammps/examples/reaxff/HNS/hpctoolkit-lmp-database hpctoolkit-lmp-database -hpcviewer ./hpctoolkit-lmp-database -``` - -Here are the acceptable parameters. - -| Name | Description | Type | Default | -|-----|-------------|------------|------| -| mount | Path to mount hpctoolview view in application container | string | /opt/share | -| events | Events for hpctoolkit | string | `-e IO` | - -Note that you can see events available with `hpcrun -L`, and use the container for this metric. -There is a brief listing on [this page](https://hpc.llnl.gov/software/development-environment-software/hpc-toolkit). -We recommend that you do not pair hpctoolkit with another metric, primarily because it is customizing the application -entrypoint. If you add a process-namespace based metric, you likely need to account for the hpcrun command being the -wrapper to the actual executable. - -#### perf-sysstat - - - [Application Metric Set](user-guide.md#application-metric-set) - *[perf-hello-world](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/perf-hello-world)* - *[perf-lammps](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/perf-lammps)* @@ -114,13 +55,9 @@ after the index at 0 gets a custom command. See [pidstat](https://man7.org/linux more information on this command, and [this file](https://github.com/converged-computing/metrics-operator/blob/main/pkg/metrics/perf/sysstat.go) for how we use them. If there is an option or command that is not exposed that you would like, please [open an issue](https://github.com/converged-computing/metrics-operator/issues). -### Storage -These metrics are intended to assess storage volumes. +### io-fio -#### io-fio - - - [Storage Metric Set](user-guide.md#application-metric-set) - *[io-host-volume](https://github.com/converged-computing/metrics-operator/tree/main/examples/storage/google/io-fusion)* This is a nice tool that you can simply point at a path, and it measures IO stats by way of writing a file there! @@ -128,18 +65,20 @@ Options you can set include: |Name | Description | Type | Default | |-----|-------------|------------|------| -|testname | Name for the test | string | test | +| testname | Name for the test | string | test | | blocksize | Size of block to write. It defaults to 4k, but can be set from 256 to 8k. | string | 4k | | iodepth | Number of I/O units to keep in flight against the file. | int | 64 | | size | Total size of file to write | string | 4G | | directory | Directory (usually mounted) to test. | string | /tmp | +| pre | Custom logic / command to run before Fio | string | unset | +| post | Custom logic / command to run after Fio (e.g., cleanup) | string | unset | +| prefix | Prefix to add to running fio commands (like a wrapper) | string | unset | -For the last "directory" we use this location to write a temporary file, which will be cleaned up. +For the "directory" we use this location to write a temporary file, which will be cleaned up. This allows for testing storage mounted from multiple metric pods without worrying about a name conflict. -#### io-ior +### io-ior - - [Storage Metric Set](user-guide.md#application-metric-set) - *[io-host-volume](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/io-ior)* ![img/ior.jpeg](img/ior.jpeg) @@ -157,9 +96,8 @@ basic commands are done. Note that the container does have mpirun if you want to for this across nodes, but this could be added. [Let us know](https://github.com/converged-computing/metrics-operator/issues) if this would be interesting to you. -#### io-sysstat +### io-sysstat - - [Storage Metric Set](user-guide.md#application-metric-set) - *[io-host-volume](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/io-host-volume)* This is the "iostat" executable of the sysstat library. @@ -169,17 +107,13 @@ This is the "iostat" executable of the sysstat library. | human | Show tabular, human-readable output inside of json | string "true" or "false" | "false" | | completions | Number of times to run metric | int32 | unset (runs for lifetime of application or indefinitely) | | rate | Seconds to pause between measurements | int32 | 10 | +| pre | One or more commands to run before iostat | string | unset | +| post | One or more commands to run after iostat | string | unset | This is good for mounted storage that can be seen by the operating system, but may not work for something like NFS. -### Standalone - -Standalone metrics can take on many designs, from a launcher/worker design to test networking, to running -a metric across nodes to assess the node performance. - -#### network-netmark +### network-netmark - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[network-netmark](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/network-netmark)* (code still private) This is currently a private container/software, but we have support for it when it's ready to be made public (networking) @@ -193,10 +127,10 @@ Variables to customize include: | sendReceiveCycles | Number of send-receive cycles | options-sendReceiveCycles | int32 | 20 | | messageSize | Message size in bytes | options->messageSize | int32 | 0 | | storeEachTrial | Flag to indicate storing each trial data | options->storeEachTrial | string (true/false) | "true" | +| soleTenancy | Turn off sole tenancy (one pod/node) | options->soleTenancy | string ("false" or "no") | "true" | -#### network-osu-benchmark +### network-osu-benchmark - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[network-osu-benchmark](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/network-osu-benchmark)* Point to point benchmarks for MPI (networking). If listOptions->commands not set, will use all one-point commands. @@ -205,10 +139,11 @@ Variables to customize include: |Name | Description | Option Key | Type | Default | |-----|-------------|------------|------|---------| | commands | Custom list of osu-benchmark one-sided commands to run | listOptions->commands | array | unset uses default set | -| sole-tenancy | Turn off sole tenancy (one pod/node) | string ("false" or "no") | "true" | +| soleTenancy | Turn off sole tenancy (one pod/node) | string ("false" or "no") | "true" | | all | Run ALL benchmarks with defaults | string ("true" or "yes") | "false" | | flags | Overwrite defaults flags (experts only!)| string | Defaults to an ideal set per metric (see [osu-benchmark.go](https://github.com/converged-computing/metrics-operator/blob/main/pkg/metrics/network/osu-benchmark.go))| | timed | String "true" or "yes" to add time prefix to mpirun (for debugging, etc) | string | "false" | +| sleep | Number of seconds to sleep to wait for network to be ready | int32 | 60 | By default, we run a subset of commands: @@ -289,9 +224,8 @@ Here are some useful resources for the benchmarks: - [HPC Council](https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1284538459/OSU+Benchmark+Tuning+for+2nd+Gen+AMD+EPYC+using+HDR+InfiniBand+over+HPC-X+MPI) - [AWS Tutorials](https://www.hpcworkshops.com/08-efa/04-complie-run-osu.html) -#### app-lammps +### app-lammps - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-lammps](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-lammps)* Since we were using LAMMPS so often as a benchmark (and testing timing of a network) it made sense to add it here @@ -302,6 +236,7 @@ to assess performance. |-----|-------------|------------|------|---------| | command | The full mpirun and lammps command | options->command |string | (see below) | | workdir | The working directory for the command | options->workdir | string | /opt/lammps/examples/reaxff/HNS# | +| soleTenancy | require each pod to have sole tenancy | command->soleTenancy | string | "false" | For inspection, you can see all the examples provided [in the LAMMPS GitHub repository](https://github.com/lammps/lammps/tree/develop/examples). The default command (if you don't change it) intended as an example is: @@ -314,9 +249,8 @@ In the working directory `/opt/lammps/examples/reaxff/HNS#`. You should be calli You should also provide the correct number of processes (np) and problem size for LAMMPS (lmp). We left this as open and flexible anticipating that you as a user would want total control. -#### app-amg +### app-amg - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-amg](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-amg)* AMG means "algebraic multi-grid" and it's easy to confuse with the company [AMD](https://www.amd.com/en/solutions/supercomputing-and-hpc) "Advanced Micro Devices" ! From [the guide](https://asc.llnl.gov/sites/asc/files/2020-09/AMG_Summary_v1_7.pdf): @@ -344,7 +278,7 @@ By default, akin to LAMMPS we expose the entire mpirun command along with the wo | Name | Description | Option Key | Type | Default | |-----|-------------|------------|------|---------| | command | The amg command (without mpirun) | options->command |string | (see below) | -| mpirun | The mpirun command (and arguments) | options->mpirun | string | (see below) | +| prefix | The prefix (mpirun command and arguments) | options->mpirun | string | (see below) | | workdir | The working directory for the command | options->workdir | string | /opt/AMG | By default, when not set, you will just run the amg binary to get a test case run: @@ -369,9 +303,8 @@ More likely you want an actual problem size on a specific number of node and tas run a larger problem and the parser does not work as expected, please [send us the output](https://github.com/converged-computing/metrics-operator/issues) and we will provide an updated parser. See [this guide](https://asc.llnl.gov/sites/asc/files/2020-09/AMG_Summary_v1_7.pdf) for more detail. -#### app-quicksilver +### app-quicksilver - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-quicksilver](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-quicksilver)* Quicksilver is a proxy app for Monte Carlo simulation code. You can learn more about it on the [GitHub repository](https://github.com/LLNL/Quicksilver/). @@ -380,7 +313,7 @@ By default, akin to other apps we expose the entire mpirun command along with th | Name | Description | Option Key | Type | Default | |-----|-------------|------------|------|---------| | command | The qs command (without mpirun) | options->command |string | (see below) | -| mpirun | The mpirun command (and arguments) | options->mpirun | string | (see below) | +| prefix | The prefix (mpirun command and arguments) | options->mpirun | string | (see below) | | workdir | The working directory for the command | options->workdir | string | /opt/AMG | By default, when not set, you will just run the qs (quicksilver) binary on a sample problem, represented by an input text file: @@ -429,9 +362,8 @@ qs /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp You can also look more closely in the [GitHub repository](https://github.com/LLNL/Quicksilver/tree/master/Examples). -#### app-pennant +### app-pennant - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-pennant](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-pennant)* Pennant is an unstructured mesh hydrodynamics for advanced architectures. The documentation is sparse, but you @@ -441,7 +373,7 @@ By default, akin to other apps we expose the entire mpirun prefix and command al | Name | Description | Option Key | Type | Default | |-----|-------------|------------|------|---------| | command | The pennant command (without mpirun) | options->command |string | (see below) | -| mpirun | The mpirun command (and arguments) | options->mpirun | string | (see below) | +| prefix | The prefix (mpirun command and arguments) | options->mpirun | string | (see below) | | workdir | The working directory for the command | options->workdir | string | /opt/AMG | By default, when not set, you will just run pennant on a test problem, represented by an input text file: @@ -531,9 +463,8 @@ There are many input files that come in the container, and here are the fullpath And likely you will need to adjust the mpirun parameters, etc. -#### app-kripke +### app-kripke - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-kripke](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-kripke)* [Kripke](https://github.com/LLNL/Kripke) is (from the README): @@ -545,7 +476,7 @@ Akin to AMG, we allow you to modify each of the mpirun and kripke commands via: | Name | Description | Option Key | Type | Default | |-----|-------------|------------|------|---------| | command | The amg command (without mpirun) | options->command |string | (see below) | -| mpirun | The mpirun command (and arguments) | options->mpirun | string | (see below) | +| prefix | The prefix (mpirun command and arguments) | options->mpirun | string | (see below) | | workdir | The working directory for the command | options->workdir | string | /opt/AMG | By default, when not set, you will just run the kripke binary to get a test case run, so mpirun is set to be blank. @@ -577,9 +508,8 @@ ex3_colored-indexset_solution ex6_stencil-offset-layout_solution ex9_matrix-tr (meaning on the PATH in `/opt/Kripke/build/bin` in the container). For apps / metrics to be added, please see [this issue](https://github.com/converged-computing/metrics-operator/issues/30). -#### app-ldms +### app-ldms - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-ldms](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-ldms)* @@ -601,9 +531,8 @@ The following is the default command: ldms_ls -h localhost -x sock -p 10444 -l -v ``` -#### app-nekbone +### app-nekbone - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-nekbone](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-nekbone)* Nekbone comes with a set of example that primarily depend on you choosing the correct workikng directory and command to run from. @@ -627,9 +556,8 @@ And the following combinations are supported. Note that example1 did not build, You can see the archived repository [here](https://github.com/Nek5000/Nekbone). If there are interesting metrics in this project it would be worth bringing it back to life I think. -#### app-laghos +### app-laghos - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-laghos](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-laghos)* From the [Laghos README](https://github.com/CEED/Laghos): @@ -644,9 +572,8 @@ the path, so the default references it as `./laghos`. | command | The full mpirun and laghos command | options->command |string | (see below) | | workdir | The working directory for the command | options->workdir | string | /workdir/laghos | -#### app-bdas +### app-bdas - - [Standalone Metric Set](user-guide.md#application-metric-set) - *[app-bdas](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-bdas)* BDAS standards for "Big Data Analysis Suite" and you can read more about it [here](https://asc.llnl.gov/sites/asc/files/2020-09/BDAS_Summary_b4bcf27_0.pdf). diff --git a/docs/getting_started/user-guide.md b/docs/getting_started/user-guide.md index ad93a82..71ee9d6 100644 --- a/docs/getting_started/user-guide.md +++ b/docs/getting_started/user-guide.md @@ -7,26 +7,29 @@ with the Metrics Operator installed and are interested to submit your own [custo ### Overview -Our "MetricSet" is mirroring the design of a [JobSet](https://github.com/kubernetes-sigs/jobset/), which can combine multiple different things (i.e., metrics) into a cohesive unit. +Our "MetricSet" is mirroring the design of a [JobSet](https://github.com/kubernetes-sigs/jobset/), which can simply be defined as follows: + +> A Metric Set is a collection of metrics to measure IO, performance, or networking that can be customized with addons. + When you create a MetricSet using this operator, we assume that you are primarily interested in measuring an application performance, collecting storage metrics, or using a custom metric provided by the operator that has less stringent requirements. -
+Each metric provided by the operator (ranging from network to applications to io) has a prebuilt container, and knows how to launch one or more replicated jobs +to measure or assess the performance of something. A MetricSet itself is just a single shell for some metric, which can be further customized with addons. +A MetricAddon "addon" is flexible to be any kind of "extra" that is needed to supplement a metric run - e.g., applications, volumes/storage, or +even extra containers that add logic. High level, this includes: -Logic Flow of Metrics Operator + - Add extra containers (and config maps for their entrypoints) + - Add custom logic to entrypoints for specific jobs and/or containers + - Add additional volumes that range the gamut from empty to persistent disk. -Given the above assumption, the logic flow of the operator works as follows: +And specific examples might include: -1. You write a metrics.yaml file that optionally includes an application OR storage description or neither for a custom metric. Typically, you'd provide an application for performance metrics, and storage for IO/filesystem metrics, and neither for a custom metric. In simpler terms, we have three types of MetricSet - Application Metric Sets, Storage Metric Sets, and Standalone Metric Sets. -2. You also include a list of metrics. Each metric you choose is associated with a type (internal to the operator) that can match to an Application, Storage, or Standalone Metric set. Don't worry, this is checked for you, and you can use our **TBA** metrics registry to find metrics of interest. -3. The operator will create a JobSet for your metrics set. The structure of this JobSet depends on the type (see more below under [metrics](#metrics)). Generally: - - Application metrics create a JobSet with each metric as a sidecar container sharing the process namespace to monitor (they can be given volumes if needed) - - Storage metrics deploy the metrics as containers and give them access to the volume - - Standalone metrics can do any custom design needed, and do not require application or storage (but can be provided storage volumes) + - Every kind of volume is provided as a volume addon, this way you can run a storage metric against some kind of mounted storage. + - A container (application) addon makes it easy to add your custom container to run alongside a metric that shares (and monitors) the process namespace + - A monitoring tool provided via a modular install for a container can be provided as an addon, and it works by creating container, and sharing assets via an empty volume shared with some metric container(s) of interest. The sharing and setup of the volume happens via customizing the main metric entrypoint(s) and also adding a custom config map volume (for the addon container entrypoint). -
- -Generally, you'll be defining an application container with one or more metrics to assess performance, or a storage solution with the same, but metrics to assess IO. There are several modes of operation, depending on your choice of metrics. +Within this space, we can easily customize the patterns of metrics by way of shared interfaces. Common patterns for shared interfaces currently include a `LauncherWorker`, `SingleApplication`, and `StorageGeneric` design. ### Install @@ -93,8 +96,8 @@ TEST SUITE: None Let's first review how this works. -1. We provide metrics here to assess performance, storage, networking, and other custom cases (called standalone). -2. You can choose one or more metrics to run alongside your application or storage (volumes) and measure. +1. We provide metrics here to assess performance, storage, networking, or other custom cases (e.g run an HPC application). +2. You can choose to supplement a metric with addons (e.g., add a volume to an IO metric) 3. The metric output is printed in pod logs with a standard packaging (e.g., sections and headers) to distinguish output sections. 4. We provide a Python module [metricsoperator](https://pypi.org/project/metricsoperator/) that can help you run an experiment, applying the metrics.yaml and then retrieving and parsing logs. @@ -119,167 +122,8 @@ For all metric types, the following applies: 1. You can create more than one pod (scale the metric) as you see fit. 2. There is always a headless service provided for metrics within the JobSet to make use of. -3. The definition of metrics in your metrics.yaml file is consistent across types. -4. Each metric type in the list can take a rate, completions, and custom options. - -For another overview of these designs, please see the [developer docs](../development/index.md). - -### Application Metric Set - -> An application metric set includes one or more metrics for measuring application performance. We take two strategies: - - - Share the process namespace, giving access of the metric container to the process space of the application - - Share a volume on the filesystem, allowing content from the metrics container to be used in the application container - -Let's walk through an example. In the image below, you want to run one or more custom metrics to measure performance for your application of choice. - -![img/application-metric-set-diagram.png](img/application-metric-set-diagram.png) - -You'll do this by writing a metrics.yaml file (left panel) that defines the application of interest, which in this case in LAMMPS. -This will be handed to the metrics operator (middle panel) that will validate your MetricSet and prepare to deploy, and -the result is a JobSet (right panel) that includes a Job with one or more containers alongside your application. -Let's look at this process in more detail. Here is what the metrics.yaml file might look like. -Note that the image above defines two metrics, but the YAML file below only shows a list of one. - -```yaml -apiVersion: flux-framework.org/v1alpha1 -kind: MetricSet -metadata: - labels: - app.kubernetes.io/name: metricset - app.kubernetes.io/instance: metricset-sample - name: metricset-sample -spec: - application: - image: ghcr.io/rse-ops/vanilla-lammps:tag-latest - command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite - metrics: - - name: perf-sysstat -``` - -It was a design choice that using an application container in this context requires no changes to the container itself. -You simply need to know what the entrypoint command is, and this will allow the metric sidecar containers to monitor it. -In our case, for our command we are issuing `mpirun`, and that's what we want to monitor. Thus, the `image` and `command` attributes are the -only two required for a basic application setup. For the next section, "metrics" we've found an application metric (so it can be used for an -Application Metric Set) that we like called `perf-sysstat`, and we add it to the list. We could easily have added more, because one -application run can be monitored by several tools, but we will keep it simple for this example. Next, let's submit this to the metrics operator. - -```bash -$ kubectl apply -f metrics.yaml -``` - -When the operator receives the custom resource, it will do some obvious validation, like did you provide application metrics for an application? -Did you provide all the required fields? Did you only provide a definition for one metric type? Any errors will result in not creating the MetricSet, -and an error in the operator logs. Given that you've provided a valid custom resource YAML and one or more application metrics -that the operator knows, it will then select your metrics from the set it has internally defined (middle panel). This panel shows that -the operator knows about several application (green), storage (yellow), and standalone (red) metrics, and it's going -to combine them into a JobSet that includes your application container to allow each metric to assess performance. - -![img/application-metric-set.png](img/application-metric-set.png) - -Focusing on the third panel, the way this works is that we create a JobSet with a single replicated job with multiple containers. -One container is for your application, and the others are sidecar containers that are running the metrics. Again, because of this design -you don't need to customize or tweak your application container! By way of the shared process namespace and knowing the command you've -executed, the sidecar containers can easily "peek in" to your application container to monitor the process and save metrics. -For this application metric set design, all containers should complete to determine success of the JobSet, and we currently -rely on pod logs to get output, however we hope to have a more solid solution for this in the future. - -### Storage Metric - -> A storage metric set includes one or more metrics for assessing one or more volumes - -If you are interested in measuring the goodness of different kinds of volumes, you might be interested in creating a storage metric set! The design is similar -to an application metrics set, however instead of an application of interest, you provide one or more storage volumes of interest. Here is a small -example that assumes a host volume: - -```yaml -apiVersion: flux-framework.org/v1alpha1 -kind: MetricSet -metadata: - labels: - app.kubernetes.io/name: metricset - app.kubernetes.io/instance: metricset-sample - name: metricset-sample -spec: - storage: - volume: - # This is the path on the host (e.g., inside kind container) - hostPath: /tmp/workflow - - # This is the path in the container - path: /workflow - - metrics: - - name: io-sysstat - rate: 10 - completions: 2 -``` - -In the above, we want to use the storage metric called "io-sysstat" to assess a host volume at `/tmp/workflow` that is mounted to `/workflow` in the container. Since a volume -could last forever (hypothetically) we ask for two completions 10 seconds apart each. This means we will get data for two timepoints from the metric, and after that, -the assessment will be complete. We can also look at this visually: - -![img/storage-metric-set-diagram.png](img/storage-metric-set-diagram.png) - -In the above, we are providing storage metrics (the image has two despite the yaml above showing one) that the operator knows about, along with a storage volume that we want to test. -The operator will prepare a JobSet with one replicated job and several containers, where one container is created per storage metric, and the volume bound to each. - -![img/storage-metric-set.png](img/storage-metric-set.png) - -In simple terms, a storage metric set will use the volume of interest that you request, and run the tool there. -Read/write is important here - e.g., if the metric needs to write to the volume, a read only volume won't work. Setting up storage -is complex, so it's typically up for you to create the PVC and then the operator will create the volume for it. Keep in mind that you should -honor RWX (read write many) vs just RW (read write) depending on the design you choose. Also note that by default, we only create one pod, -but if appropriate you can scale up to more. - -### Standalone Metric - -> A custom, standalone metric that doesn't abide by any rules! - -The standalone metric is the most interesting of the set, as it doesn't have a strict requirement for a storage or application definition. -We currently have a few flavors of standalone metrics that include: - - - applications that are timed (e.g., LAMMPS) - - networking tools (e.g., OSU benchmarks and netmark) - -By definition, it is "standalone" because it's going to create a custom JobSet setup for a metric of interest. Because we cannot be -certain of how to combine different jobs within this JobSet, we currently only allow one standalone metric to be defined at once. -This means that in the diagram below, you see online one standalone metric in the metrics.yaml - -![img/standalone-metric-set-diagram.png](img/standalone-metric-set-diagram.png) - -As an example, we can look at a standalone metric to run a tool called netmark. - -```yaml -apiVersion: flux-framework.org/v1alpha1 -kind: MetricSet -metadata: - labels: - app.kubernetes.io/name: metricset - app.kubernetes.io/instance: metricset-sample - name: metricset-sample -spec: - # Number of indexed jobs to run netmark on - pods: 4 - metrics: - - name: network-netmark - - # Custom options for netmark - # see pkg/metrics/network/netmark.go - options: - tasks: 4 -``` - -This is a standalone metric because it creates a JobSet with not one replicated job, but two! There is a launcher container -to issue an `mpirun` command, and one or more worker containers that interact via MPI. This is a simple example, but any design -for a JobSet could work here, and hence why the metric is standalone. However, it's neat that the interface presented to you -is consistent - it's simply a matter of asking for the metric that is known to the operator to be a standalone. -The image below also demonstrates that this standalone metric (along with storage or application) can be scaled to more -than one pod, if appropriate. - -![img/standalone-metric-set.png](img/standalone-metric-set.png) -For more detail about this design, see the [developer docs](../development/index.md). +For another overview of these designs, please see the [developer docs](../development/designs/index.md). ## Containers Available diff --git a/docs/make.bat b/docs/make.bat old mode 100644 new mode 100755 diff --git a/examples/dist/metrics-operator-arm.yaml b/examples/dist/metrics-operator-arm.yaml index 895d36f..a1237d4 100644 --- a/examples/dist/metrics-operator-arm.yaml +++ b/examples/dist/metrics-operator-arm.yaml @@ -27,7 +27,7 @@ spec: singular: metricset scope: Namespaced versions: - - name: v1alpha1 + - name: v1alpha2 schema: openAPIV3Schema: description: MetricSet is the Schema for the metrics API @@ -43,95 +43,6 @@ spec: spec: description: MetricSpec defines the desired state of Metric properties: - application: - description: For metrics that require an application, we need a container and name (for now) - properties: - attributes: - description: Container Spec has attributes for the container - properties: - securityContext: - description: Security context for the pod - properties: - privileged: - type: boolean - required: - - privileged - type: object - type: object - command: - description: command to execute and monitor (if consistent across pods) - type: string - entrypoint: - description: Entrypoint of container, if different from command - type: string - image: - type: string - pullSecret: - description: A pull secret for the application container - type: string - resources: - description: Resources include limits and requests for the application - properties: - limits: - additionalProperties: - anyOf: - - type: integer - - type: string - x-kubernetes-int-or-string: true - type: object - requests: - additionalProperties: - anyOf: - - type: integer - - type: string - x-kubernetes-int-or-string: true - type: object - type: object - volumes: - additionalProperties: - description: 'A Volume should correspond with an existing volume, either: config map, secret, or claim name.' - properties: - claimName: - description: Claim name if the existing volume is a PVC - type: string - configMapName: - description: Config map name if the existing volume is a config map You should also define items if you are using this - type: string - emptyVol: - default: false - description: EmptyVol if true generates an empty volume at the path - type: boolean - hostPath: - description: Hostpath volume on the host to bind to path - type: string - items: - additionalProperties: - type: string - description: Items (key and paths) for the config map - type: object - path: - description: Path and claim name are always required if a secret isn't defined - type: string - readOnly: - default: false - type: boolean - secretName: - description: An existing secret - type: string - type: object - description: Existing Volumes for the application - type: object - workingDir: - description: Working Directory - type: string - required: - - command - - image - type: object - completions: - description: Single pod completion, meaning the jobspec completions is unset and we only require one main completion - format: int32 - type: integer deadlineSeconds: default: 31500000 description: Should the job be limited to a particular number of seconds? Approximately one year. This cannot be zero or job won't start @@ -150,20 +61,63 @@ spec: metrics: description: The name of the metric (that will be associated with a flavor like storage) items: - description: The difference between benchmark and metric is subtle. A metric is more a measurment, and the benchmark is the comparison value. I don't have strong opinions but I think we are doing more measurment not necessarily with benchmarks properties: + addons: + description: A Metric addon can be storage (volume) or an application, It's an additional entity that can customize a replicated job, either adding assets / features or entire containers to the pod + items: + description: 'A Metric addon is an interface that exposes extra volumes for a metric. Examples include: A storage volume to be mounted on one or more of the replicated jobs A single application container.' + properties: + listOptions: + additionalProperties: + items: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + type: array + description: Addon List Options + type: object + mapOptions: + additionalProperties: + additionalProperties: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + type: object + description: Addon Map Options + type: object + name: + type: string + options: + additionalProperties: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + description: Metric Addon Options + type: object + required: + - name + type: object + type: array attributes: description: Container Spec has attributes for the container properties: securityContext: description: Security context for the pod properties: + allowAdmin: + type: boolean + allowPtrace: + type: boolean privileged: type: boolean - required: - - privileged type: object type: object + image: + description: Use a custom container image (advanced users only) + type: string listOptions: additionalProperties: items: @@ -245,54 +199,6 @@ spec: default: ms description: Service name for the JobSet (MetricsSet) cluster network type: string - storage: - description: A storage setup that we want to measure performance for. and binding to storage metrics - properties: - commands: - description: Commands to run (pre is supported to make bind) - properties: - post: - description: post happens at end (after collection end) - type: string - pre: - description: pre command happens at start (before anything else) - type: string - prefix: - description: Command prefix to put in front of a metric main command (not applicable for all) - type: string - type: object - volume: - description: Volume type to test (not all storage interfaces require one explicitly) - properties: - claimName: - description: Claim name if the existing volume is a PVC - type: string - configMapName: - description: Config map name if the existing volume is a config map You should also define items if you are using this - type: string - emptyVol: - default: false - description: EmptyVol if true generates an empty volume at the path - type: boolean - hostPath: - description: Hostpath volume on the host to bind to path - type: string - items: - additionalProperties: - type: string - description: Items (key and paths) for the config map - type: object - path: - description: Path and claim name are always required if a secret isn't defined - type: string - readOnly: - default: false - type: boolean - secretName: - description: An existing secret - type: string - type: object - type: object type: object status: description: MetricStatus defines the observed state of Metric diff --git a/examples/dist/metrics-operator.yaml b/examples/dist/metrics-operator.yaml index 3df58f3..6075192 100644 --- a/examples/dist/metrics-operator.yaml +++ b/examples/dist/metrics-operator.yaml @@ -27,7 +27,7 @@ spec: singular: metricset scope: Namespaced versions: - - name: v1alpha1 + - name: v1alpha2 schema: openAPIV3Schema: description: MetricSet is the Schema for the metrics API @@ -43,95 +43,6 @@ spec: spec: description: MetricSpec defines the desired state of Metric properties: - application: - description: For metrics that require an application, we need a container and name (for now) - properties: - attributes: - description: Container Spec has attributes for the container - properties: - securityContext: - description: Security context for the pod - properties: - privileged: - type: boolean - required: - - privileged - type: object - type: object - command: - description: command to execute and monitor (if consistent across pods) - type: string - entrypoint: - description: Entrypoint of container, if different from command - type: string - image: - type: string - pullSecret: - description: A pull secret for the application container - type: string - resources: - description: Resources include limits and requests for the application - properties: - limits: - additionalProperties: - anyOf: - - type: integer - - type: string - x-kubernetes-int-or-string: true - type: object - requests: - additionalProperties: - anyOf: - - type: integer - - type: string - x-kubernetes-int-or-string: true - type: object - type: object - volumes: - additionalProperties: - description: 'A Volume should correspond with an existing volume, either: config map, secret, or claim name.' - properties: - claimName: - description: Claim name if the existing volume is a PVC - type: string - configMapName: - description: Config map name if the existing volume is a config map You should also define items if you are using this - type: string - emptyVol: - default: false - description: EmptyVol if true generates an empty volume at the path - type: boolean - hostPath: - description: Hostpath volume on the host to bind to path - type: string - items: - additionalProperties: - type: string - description: Items (key and paths) for the config map - type: object - path: - description: Path and claim name are always required if a secret isn't defined - type: string - readOnly: - default: false - type: boolean - secretName: - description: An existing secret - type: string - type: object - description: Existing Volumes for the application - type: object - workingDir: - description: Working Directory - type: string - required: - - command - - image - type: object - completions: - description: Single pod completion, meaning the jobspec completions is unset and we only require one main completion - format: int32 - type: integer deadlineSeconds: default: 31500000 description: Should the job be limited to a particular number of seconds? Approximately one year. This cannot be zero or job won't start @@ -150,20 +61,63 @@ spec: metrics: description: The name of the metric (that will be associated with a flavor like storage) items: - description: The difference between benchmark and metric is subtle. A metric is more a measurment, and the benchmark is the comparison value. I don't have strong opinions but I think we are doing more measurment not necessarily with benchmarks properties: + addons: + description: A Metric addon can be storage (volume) or an application, It's an additional entity that can customize a replicated job, either adding assets / features or entire containers to the pod + items: + description: 'A Metric addon is an interface that exposes extra volumes for a metric. Examples include: A storage volume to be mounted on one or more of the replicated jobs A single application container.' + properties: + listOptions: + additionalProperties: + items: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + type: array + description: Addon List Options + type: object + mapOptions: + additionalProperties: + additionalProperties: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + type: object + description: Addon Map Options + type: object + name: + type: string + options: + additionalProperties: + anyOf: + - type: integer + - type: string + x-kubernetes-int-or-string: true + description: Metric Addon Options + type: object + required: + - name + type: object + type: array attributes: description: Container Spec has attributes for the container properties: securityContext: description: Security context for the pod properties: + allowAdmin: + type: boolean + allowPtrace: + type: boolean privileged: type: boolean - required: - - privileged type: object type: object + image: + description: Use a custom container image (advanced users only) + type: string listOptions: additionalProperties: items: @@ -245,54 +199,6 @@ spec: default: ms description: Service name for the JobSet (MetricsSet) cluster network type: string - storage: - description: A storage setup that we want to measure performance for. and binding to storage metrics - properties: - commands: - description: Commands to run (pre is supported to make bind) - properties: - post: - description: post happens at end (after collection end) - type: string - pre: - description: pre command happens at start (before anything else) - type: string - prefix: - description: Command prefix to put in front of a metric main command (not applicable for all) - type: string - type: object - volume: - description: Volume type to test (not all storage interfaces require one explicitly) - properties: - claimName: - description: Claim name if the existing volume is a PVC - type: string - configMapName: - description: Config map name if the existing volume is a config map You should also define items if you are using this - type: string - emptyVol: - default: false - description: EmptyVol if true generates an empty volume at the path - type: boolean - hostPath: - description: Hostpath volume on the host to bind to path - type: string - items: - additionalProperties: - type: string - description: Items (key and paths) for the config map - type: object - path: - description: Path and claim name are always required if a secret isn't defined - type: string - readOnly: - default: false - type: boolean - secretName: - description: An existing secret - type: string - type: object - type: object type: object status: description: MetricStatus defines the observed state of Metric diff --git a/examples/python/app-amg/metrics.json b/examples/python/app-amg/metrics.json index b71498f..2dcd845 100644 --- a/examples/python/app-amg/metrics.json +++ b/examples/python/app-amg/metrics.json @@ -1,811 +1,15 @@ [ { "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.154680 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.626910 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.069727 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.204182 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "8.967483 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "25.461097 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 5.468536e+03" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "22.670724 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "79.430604 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 3.028337e+04", - "figure_of_merit_(fom_1)": "2.407966e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.233553 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.866093 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.042373 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.126982 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "4.685015 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "18.331535 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 1.046720e+04" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "19.920358 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "75.432426 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 3.446454e+04", - "figure_of_merit_(fom_1)": "2.846521e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.240115 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.922137 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.059379 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.263196 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "5.564823 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "19.814483 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 8.812320e+03" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "23.945171 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "81.345526 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 2.867159e+04", - "figure_of_merit_(fom_1)": "2.370677e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.222429 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.874576 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.043813 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.104386 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "6.022492 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "22.145176 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 8.142642e+03" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "22.838187 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "80.316067 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 3.006132e+04", - "figure_of_merit_(fom_1)": "2.458165e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.247462 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.864699 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.057285 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.184888 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "5.777684 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "20.600999 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 8.487657e+03" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "23.086673 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "80.998381 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 2.973776e+04", - "figure_of_merit_(fom_1)": "2.442524e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.238863 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.863483 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.060950 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.157549 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "4.881566 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "18.379813 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 1.004575e+04" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "23.461902 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "80.350049 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 2.926216e+04", - "figure_of_merit_(fom_1)": "2.445806e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.148020 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.622478 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.043564 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.154993 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "6.770451 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "23.645068 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 7.243092e+03" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "35.372723 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "97.783956 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 1.940891e+04", - "figure_of_merit_(fom_1)": "1.636746e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.227810 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.809670 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.069407 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.254313 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "5.671133 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "20.602727 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 8.647126e+03" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "22.816052 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "80.114706 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 3.009048e+04", - "figure_of_merit_(fom_1)": "2.472964e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.185441 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.661079 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.058139 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.159175 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "5.218203 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "19.601490 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 9.397679e+03" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "19.777390 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "72.259056 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 3.471368e+04", - "figure_of_merit_(fom_1)": "2.838468e+04" - } - } - ], - "metadata": { - "pods": 2, - "completions": 2, - "metricName": "app-amg", - "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", - "metricOptions": { - "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, - "workdir": "/opt/AMG" - } - }, - "spec": { - "apiVersion": "flux-framework.org/v1alpha1", - "kind": "MetricSet", - "metadata": { - "labels": { - "app.kubernetes.io/name": "metricset", - "app.kubernetes.io/instance": "metricset-sample" - }, - "name": "metricset-sample" - }, - "spec": { - "pods": 2, - "metrics": [ - { - "name": "app-amg" - } - ] - } - } - }, - { - "data": [ - { - "driver_params": { - "solver_id": "1", - "laplacian_27pt": { - "(nx,_ny,_nz)": "(10, 20, 10)", - "(px,_py,_pz)": "(1, 2, 1)" - } - }, - "generate_matrix": { - "spatial_operator": { - "wall_clock_time": "0.188481 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.613128 seconds", - "cpu_mflops": "0.000000" - } - }, - "vector_setup": { - "rhs_and_initial_guess": { - "wall_clock_time": "0.053878 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "0.190935 seconds", - "cpu_mflops": "0.000000" - } - }, - "problem_setup": { - "pcg_setup": { - "wall_clock_time": "6.171202 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "21.461641 seconds", - "cpu_mflops": "0.000000" - }, - "fom_setup": "nnz_AP / Setup Phase Time: 7.946426e+03" - }, - "solve_time": { - "pcg_solve": { - "wall_clock_time": "21.616691 seconds", - "wall_mflops": "0.000000", - "cpu_clock_time": "77.353095 seconds", - "cpu_mflops": "0.000000" - }, - "iterations": "14", - "final_relative_residual_norm": "4.643894e-09", - "fom_solve": "nnz_AP * Iterations / Solve Phase Time: 3.175999e+04", - "figure_of_merit_(fom_1)": "2.580660e+04" - } - } + {} ], "metadata": { "pods": 2, - "completions": 2, "metricName": "app-amg", "metricDescription": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - "metricType": "standalone", "metricOptions": { "command": "amg", - "completions": 0, - "mpirun": "mpirun --hostfile ./hostlist.txt", - "rate": 10, + "prefix": "mpirun --hostfile ./hostlist.txt", "workdir": "/opt/AMG" } }, diff --git a/examples/python/io-fio/metrics.yaml b/examples/python/io-fio/metrics.yaml index 28fa3cf..9f2a54c 100644 --- a/examples/python/io-fio/metrics.yaml +++ b/examples/python/io-fio/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/python/network-netmark/metrics.yaml b/examples/python/network-netmark/metrics.yaml index 2e9aad9..b758d36 100644 --- a/examples/python/network-netmark/metrics.yaml +++ b/examples/python/network-netmark/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/python/perf-hello-world/metrics.yaml b/examples/python/perf-hello-world/metrics.yaml index b09b6d5..d050d5a 100644 --- a/examples/python/perf-hello-world/metrics.yaml +++ b/examples/python/perf-hello-world/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/python/perf-sysstat/metrics.yaml b/examples/python/perf-sysstat/metrics.yaml index bea6f76..af30f79 100644 --- a/examples/python/perf-sysstat/metrics.yaml +++ b/examples/python/perf-sysstat/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/app-amg/metrics.yaml b/examples/tests/app-amg/metrics.yaml index de0cc35..dbbbde2 100644 --- a/examples/tests/app-amg/metrics.yaml +++ b/examples/tests/app-amg/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -9,6 +9,7 @@ spec: # Number of indexed jobs to run netmark on pods: 2 metrics: + # This uses the default commands - name: app-amg diff --git a/examples/tests/app-bdas/metrics.yaml b/examples/tests/app-bdas/metrics.yaml index 8417e6c..5f0264e 100644 --- a/examples/tests/app-bdas/metrics.yaml +++ b/examples/tests/app-bdas/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/app-hpl/metrics.yaml b/examples/tests/app-hpl/metrics.yaml index 476817d..9a74001 100644 --- a/examples/tests/app-hpl/metrics.yaml +++ b/examples/tests/app-hpl/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/app-kripke/metrics.yaml b/examples/tests/app-kripke/metrics.yaml index 2450439..7b3040c 100644 --- a/examples/tests/app-kripke/metrics.yaml +++ b/examples/tests/app-kripke/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/app-laghos/metrics.yaml b/examples/tests/app-laghos/metrics.yaml index cd42c17..2a58e5b 100644 --- a/examples/tests/app-laghos/metrics.yaml +++ b/examples/tests/app-laghos/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/app-lammps/README.md b/examples/tests/app-lammps/README.md index c027280..fad2dae 100644 --- a/examples/tests/app-lammps/README.md +++ b/examples/tests/app-lammps/README.md @@ -26,7 +26,7 @@ How to see metrics operator logs: $ kubectl logs -n metrics-system metrics-controller-manager-859c66464c-7rpbw ``` -Then create the metrics set. This is going to run a single run of LAMMPS over MPI! +Then create the metrics set. This is going to run a single run of LAMMPS over MPI. as lammps runs. ```bash diff --git a/examples/tests/app-lammps/metrics.yaml b/examples/tests/app-lammps/metrics.yaml index 284281c..386c5b6 100644 --- a/examples/tests/app-lammps/metrics.yaml +++ b/examples/tests/app-lammps/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -10,9 +10,4 @@ spec: pods: 2 metrics: # This uses the default commands - - name: app-lammps - - # This should the default - you are responsible for asking for the right number of processes, - # lammps arguments, and calling mpirun to point at an expected hostfile in the workdir. - # options: - # command: mpirun --hostfile ./hostlist.txt -np 2 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite \ No newline at end of file + - name: app-lammps \ No newline at end of file diff --git a/examples/tests/app-ldms/metrics.yaml b/examples/tests/app-ldms/metrics.yaml index 9aae012..f704e57 100644 --- a/examples/tests/app-ldms/metrics.yaml +++ b/examples/tests/app-ldms/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/app-nekbone/metrics.yaml b/examples/tests/app-nekbone/metrics.yaml index 2cc1187..a160a12 100644 --- a/examples/tests/app-nekbone/metrics.yaml +++ b/examples/tests/app-nekbone/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/app-pennant/metrics.yaml b/examples/tests/app-pennant/metrics.yaml index 9efe4f5..f643628 100644 --- a/examples/tests/app-pennant/metrics.yaml +++ b/examples/tests/app-pennant/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/app-quicksilver/metrics.yaml b/examples/tests/app-quicksilver/metrics.yaml index 8d66a45..f8bb325 100644 --- a/examples/tests/app-quicksilver/metrics.yaml +++ b/examples/tests/app-quicksilver/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/io-fio/metrics.yaml b/examples/tests/io-fio/metrics.yaml index 80b7181..6bbe103 100644 --- a/examples/tests/io-fio/metrics.yaml +++ b/examples/tests/io-fio/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -6,19 +6,17 @@ metadata: app.kubernetes.io/instance: metricset-sample name: metricset-sample spec: - storage: - volume: - # This is the path on the host (e.g., inside kind container) - hostPath: /tmp/workflow - - # This is the path in the container - path: /tmp/workflow - metrics: - # Fio just runs once - no concept of completions / rate - name: io-fio options: size: 1M blocksize: 1K directory: /tmp/workflow + # Fio usually will have a volume as an addon, let's do hostpath here + addons: + - name: volume-hostpath + options: + name: fio-mount + hostPath: /tmp/workflow + path: /tmp/workflow \ No newline at end of file diff --git a/examples/tests/io-fio/post-run.sh b/examples/tests/io-fio/post-run.sh index f9c0beb..4d9d000 100644 --- a/examples/tests/io-fio/post-run.sh +++ b/examples/tests/io-fio/post-run.sh @@ -1,4 +1,4 @@ #!/bin/bash echo "Cleaning up /tmp/workflow in minikube" -minikube ssh -- sudo rm -rf /tmp/workflow +minikube ssh -- sudo rm -rf /tmp/workflow \ No newline at end of file diff --git a/examples/tests/io-host-volume/metrics.yaml b/examples/tests/io-host-volume/metrics.yaml index af13fce..e00c8ec 100644 --- a/examples/tests/io-host-volume/metrics.yaml +++ b/examples/tests/io-host-volume/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -6,19 +6,15 @@ metadata: app.kubernetes.io/instance: metricset-sample name: metricset-sample spec: - storage: - volume: - # This is the path on the host (e.g., inside kind container) - hostPath: /tmp/workflow - - # This is the path in the container - path: /workflow - metrics: - name: io-sysstat options: rate: 10 completions: 2 - # Add human readable output (in a table instead of json) - # human: "true" + addons: + - name: volume-hostpath + options: + name: io-mount + hostPath: /tmp/workflow + path: /workflow diff --git a/examples/tests/io-ior/metrics.yaml b/examples/tests/io-ior/metrics.yaml index ed91683..85974ff 100644 --- a/examples/tests/io-ior/metrics.yaml +++ b/examples/tests/io-ior/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -6,17 +6,14 @@ metadata: app.kubernetes.io/instance: metricset-sample name: metricset-sample spec: - storage: - volume: - # This is the path on the host (e.g., inside kind container) - hostPath: /tmp/workflow - - # This is the path in the container - path: /tmp/workflow - metrics: - # Fio just runs once - no concept of completions / rate - name: io-ior options: workdir: /tmp/workflow + addons: + - name: volume-hostpath + options: + name: io-mount + hostPath: /tmp/workflow + path: /tmp/workflow diff --git a/examples/tests/network-chatterbug/metrics.yaml b/examples/tests/network-chatterbug/metrics.yaml index 71a6b09..f199422 100644 --- a/examples/tests/network-chatterbug/metrics.yaml +++ b/examples/tests/network-chatterbug/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -17,7 +17,7 @@ spec: command: stencil3d # Args to stencil3d args: "1 2 2 10 10 10 4 1" - sole-tenancy: "false" + soleTenancy: "false" # mpirun arguments mpirun: "-N 4" diff --git a/examples/tests/network-netmark/metrics.yaml b/examples/tests/network-netmark/metrics.yaml index 2ef59d0..41f2c1c 100644 --- a/examples/tests/network-netmark/metrics.yaml +++ b/examples/tests/network-netmark/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -15,3 +15,4 @@ spec: # see pkg/metrics/network/netmark.go options: tasks: 2 + soleTenancy: "false" diff --git a/examples/tests/network-osu-benchmark/metrics.yaml b/examples/tests/network-osu-benchmark/metrics.yaml index 49a762c..0066027 100644 --- a/examples/tests/network-osu-benchmark/metrics.yaml +++ b/examples/tests/network-osu-benchmark/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -12,7 +12,8 @@ spec: interactive: true metrics: - name: network-osu-benchmark - + options: + soleTenancy: "false" # Example of resource requests / limits # You should set these to ensure 1 pod : 1 node # resources: diff --git a/examples/tests/perf-hello-world/metrics.yaml b/examples/tests/perf-hello-world/metrics.yaml index 994564d..9b970b5 100644 --- a/examples/tests/perf-hello-world/metrics.yaml +++ b/examples/tests/perf-hello-world/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -6,12 +6,17 @@ metadata: app.kubernetes.io/instance: metricset-sample name: metricset-sample spec: - # We don't have support currently for bash commands with -c quotes - hence the simple commands - application: - image: ubuntu - command: sleep 10 metrics: - name: perf-sysstat options: color: "true" + # The command we are watching for + command: sleep 10 + + # The addon that runs the container with the shared process namespace + addons: + - name: application + options: + image: ubuntu + command: sleep 10 \ No newline at end of file diff --git a/examples/tests/perf-hpctoolkit/metrics.yaml b/examples/tests/perf-hpctoolkit/metrics.yaml index a96d827..53bc705 100644 --- a/examples/tests/perf-hpctoolkit/metrics.yaml +++ b/examples/tests/perf-hpctoolkit/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: diff --git a/examples/tests/perf-lammps-hpctoolkit/metrics-rocky.yaml b/examples/tests/perf-lammps-hpctoolkit/metrics-rocky.yaml new file mode 100644 index 0000000..92c878a --- /dev/null +++ b/examples/tests/perf-lammps-hpctoolkit/metrics-rocky.yaml @@ -0,0 +1,49 @@ +apiVersion: flux-framework.org/v1alpha2 +kind: MetricSet +metadata: + labels: + app.kubernetes.io/name: metricset + app.kubernetes.io/instance: metricset-sample + name: metricset-sample +spec: + # Number of pods for lammps (one launcher, the rest workers) + pods: 4 + logging: + interactive: true + + metrics: + + # Running more scaled lammps is our main goal + - name: app-lammps + + # How to define a custom lammps container (advanced users) + # This is for if you use rocky, not the default + image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky + + options: + command: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite + workdir: /opt/lammps/examples/reaxff/HNS + + # Add on hpctoolkit, will mount a volume and wrap lammps + addons: + - name: perf-hpctoolkit + options: + mount: /opt/mnt + # Where is the event blocked / taking more time + events: "-e REALTIME@100" + + # Use a custom container here too (we have for rocky and ubuntu) + image: ghcr.io/converged-computing/metric-hpctoolkit-view:rocky + + # Don't run post analysis - script will still be generated + # postAnalysis: "false" + + # hpcrun needs to have mpirun in front of hpcrun e.g., + # mpirun hpcrun + prefix: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 4 --map-by socket + + # Ensure the working directory is consistent + workdir: /opt/lammps/examples/reaxff/HNS + + # Target container for entrypoint addition is the launcher, not workers + containerTarget: launcher \ No newline at end of file diff --git a/examples/tests/perf-lammps-hpctoolkit/metrics.yaml b/examples/tests/perf-lammps-hpctoolkit/metrics.yaml new file mode 100644 index 0000000..cd1b923 --- /dev/null +++ b/examples/tests/perf-lammps-hpctoolkit/metrics.yaml @@ -0,0 +1,41 @@ +apiVersion: flux-framework.org/v1alpha2 +kind: MetricSet +metadata: + labels: + app.kubernetes.io/name: metricset + app.kubernetes.io/instance: metricset-sample + name: metricset-sample +spec: + # Number of pods for lammps (one launcher, the rest workers) + pods: 4 + logging: + interactive: true + + metrics: + + # Running more scaled lammps is our main goal + - name: app-lammps + options: + command: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite + workdir: /opt/lammps/examples/reaxff/HNS + + # Add on hpctoolkit, will mount a volume and wrap lammps + addons: + - name: perf-hpctoolkit + options: + mount: /opt/mnt + # Where is the event blocked / taking more time + events: "-e REALTIME@100" + + # Don't run post analysis - script will still be generated + # postAnalysis: "false" + + # hpcrun needs to have mpirun in front of hpcrun e.g., + # mpirun hpcrun + prefix: mpirun --hostfile ./hostlist.txt -np 4 --map-by socket + + # Ensure the working directory is consistent + workdir: /opt/lammps/examples/reaxff/HNS + + # Target container for entrypoint addition is the launcher, not workers + containerTarget: launcher \ No newline at end of file diff --git a/examples/tests/perf-lammps/metrics.yaml b/examples/tests/perf-lammps/metrics.yaml index 7003330..7c0e35a 100644 --- a/examples/tests/perf-lammps/metrics.yaml +++ b/examples/tests/perf-lammps/metrics.yaml @@ -1,4 +1,4 @@ -apiVersion: flux-framework.org/v1alpha1 +apiVersion: flux-framework.org/v1alpha2 kind: MetricSet metadata: labels: @@ -6,8 +6,13 @@ metadata: app.kubernetes.io/instance: metricset-sample name: metricset-sample spec: - application: - image: ghcr.io/rse-ops/vanilla-lammps:tag-latest - command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite metrics: - name: perf-sysstat + options: + command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite + addons: + - name: application + options: + image: ghcr.io/rse-ops/vanilla-lammps:tag-latest + command: mpirun lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite + diff --git a/hack/addons-gen/main.go b/hack/addons-gen/main.go new file mode 100644 index 0000000..cdc34ac --- /dev/null +++ b/hack/addons-gen/main.go @@ -0,0 +1,53 @@ +package main + +import ( + "encoding/json" + "log" + "os" + "sort" + + "github.com/converged-computing/metrics-operator/pkg/addons" + // Metrics are registered here! Importing registers once + // + // +kubebuilder:scaffold:imports +) + +var ( + baseurl = "https://converged-computing.github.io/metrics-operator/getting_started/addons.html" +) + +type AddonOutput struct { + Name string `json:"name"` + Description string `json:"description"` + Family string `json:"family"` +} + +func main() { + if len(os.Args) <= 1 { + log.Fatal("Please provide a filename to write to") + } + filename := os.Args[1] + records := []AddonOutput{} + for _, addon := range addons.Registry { + newRecord := AddonOutput{ + Name: addon.Name(), + Description: addon.Description(), + Family: addon.Family(), + } + records = append(records, newRecord) + } + + // Ensure we are consistent in ordering + sort.Slice(records, func(i, j int) bool { + return records[i].Name < records[j].Name + }) + + file, err := json.MarshalIndent(records, "", " ") + if err != nil { + log.Fatalf("Could not marshall records %s\n", err.Error()) + } + err = os.WriteFile(filename, file, 0644) + if err != nil { + log.Fatalf("Could not write to file %s: %s\n", filename, err.Error()) + } +} diff --git a/hack/docs-gen/main.go b/hack/metrics-gen/main.go similarity index 94% rename from hack/docs-gen/main.go rename to hack/metrics-gen/main.go index ad79dd6..0691306 100644 --- a/hack/docs-gen/main.go +++ b/hack/metrics-gen/main.go @@ -12,7 +12,8 @@ import ( _ "github.com/converged-computing/metrics-operator/pkg/metrics/io" _ "github.com/converged-computing/metrics-operator/pkg/metrics/network" _ "github.com/converged-computing/metrics-operator/pkg/metrics/perf" - //+kubebuilder:scaffold:imports + // + // +kubebuilder:scaffold:imports ) var ( @@ -23,7 +24,6 @@ type MetricOutput struct { Name string `json:"name"` Description string `json:"description"` Family string `json:"family"` - Type string `json:"type"` Image string `json:"image"` Url string `json:"url"` } @@ -39,7 +39,6 @@ func main() { Name: metric.Name(), Description: metric.Description(), Family: metric.Family(), - Type: metric.Type(), Image: metric.Image(), Url: metric.Url(), } diff --git a/main.go b/main.go index 7a67998..4de8780 100644 --- a/main.go +++ b/main.go @@ -27,7 +27,7 @@ import ( "sigs.k8s.io/controller-runtime/pkg/log/zap" jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" controllers "github.com/converged-computing/metrics-operator/controllers/metric" // Metrics are registered here! Importing registers once @@ -35,7 +35,8 @@ import ( _ "github.com/converged-computing/metrics-operator/pkg/metrics/io" _ "github.com/converged-computing/metrics-operator/pkg/metrics/network" _ "github.com/converged-computing/metrics-operator/pkg/metrics/perf" - //+kubebuilder:scaffold:imports + // + // +kubebuilder:scaffold:imports ) var ( diff --git a/pkg/addons/addons.go b/pkg/addons/addons.go new file mode 100644 index 0000000..95a0cb7 --- /dev/null +++ b/pkg/addons/addons.go @@ -0,0 +1,132 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package addons + +import ( + "fmt" + "log" + "reflect" + + jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" + + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/specs" + "k8s.io/apimachinery/pkg/util/intstr" +) + +// An addon can support adding volumes, containers, or otherwise customizing the jobset. + +var ( + Registry = make(map[string]Addon) + AddonFamilyPerformance = "performance" + AddonFamilyVolume = "volume" + AddonFamilyApplication = "application" +) + +// A general metric is a container added to a JobSet +type Addon interface { + + // Metadata + Name() string + Family() string + Description() string + + // Options and exportable attributes + SetOptions(*api.MetricAddon) + Options() map[string]intstr.IntOrString + ListOptions() map[string][]intstr.IntOrString + MapOptions() map[string]map[string]intstr.IntOrString + + // What addons can control: + AssembleVolumes() []specs.VolumeSpec + AssembleContainers() []specs.ContainerSpec + CustomizeEntrypoints([]*specs.ContainerSpec, []*jobset.ReplicatedJob) + + // Instead of exposing individual pieces (volumes, settings, etc) + // We simply allow it to modify the job + // Attributes for JobSet, etc. + Validate() bool +} + +// Shared based of metadata and functions +type AddonBase struct { + Identifier string + Url string + Summary string + Family string + + options map[string]intstr.IntOrString + listOptions map[string][]intstr.IntOrString + mapOptions map[string]map[string]intstr.IntOrString +} + +func (b AddonBase) SetOptions(metric *api.MetricAddon) {} +func (b AddonBase) CustomizeEntrypoints([]*specs.ContainerSpec, []*jobset.ReplicatedJob) {} + +func (b AddonBase) Validate() bool { + return true +} +func (b AddonBase) AssembleContainers() []specs.ContainerSpec { + return []specs.ContainerSpec{} +} + +// Assemble Volumes (for now) just generates one +func (b AddonBase) AssembleVolumes() []specs.VolumeSpec { + return []specs.VolumeSpec{} +} + +func (b AddonBase) Description() string { + return b.Summary +} +func (b AddonBase) Name() string { + return b.Identifier +} +func (b AddonBase) Options() map[string]intstr.IntOrString { + return b.options +} +func (b AddonBase) ListOptions() map[string][]intstr.IntOrString { + return b.listOptions +} +func (b AddonBase) MapOptions() map[string]map[string]intstr.IntOrString { + return b.mapOptions +} + +// GetAddon looks up and validates an addon +func GetAddon(a *api.MetricAddon) (Addon, error) { + + // We don't want to change the addon interface/struct itself + template, ok := Registry[a.Name] + if !ok { + return nil, fmt.Errorf("%s is not a known addon", a.Name) + } + templateType := reflect.ValueOf(template) + if templateType.Kind() == reflect.Ptr { + templateType = reflect.Indirect(templateType) + } + addon := reflect.New(templateType.Type()).Interface().(Addon) + + // Set options before validation + addon.SetOptions(a) + + // Validate the addon + if !addon.Validate() { + return nil, fmt.Errorf("Addon %s did not validate", a.Name) + } + return addon, nil +} + +// TODO likely we need to carry around entrypoints to customize? + +// Register a new addon! +func Register(a Addon) { + name := a.Name() + if _, ok := Registry[name]; ok { + log.Fatalf("Addon: %s has already been added to the addon registry", name) + } + Registry[name] = a +} diff --git a/pkg/addons/containers.go b/pkg/addons/containers.go new file mode 100644 index 0000000..b933290 --- /dev/null +++ b/pkg/addons/containers.go @@ -0,0 +1,192 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package addons + +import ( + "fmt" + "strings" + + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/specs" + "k8s.io/apimachinery/pkg/util/intstr" +) + +// Container addons are typically for applications +type ApplicationAddon struct { + AddonBase + + // Container image + image string + + // Name for container + name string + + // command to execute + command string + + // Working Directory + workdir string + + // Entrypoint of container, if different from command + entrypoint string + + // A pull secret for the application container + pullSecret string + + // Resources include limits and requests for the application + resources map[string]map[string]intstr.IntOrString + + // Container Spec has attributes for the container + // Do we run this in privileged mode? + privileged bool +} + +// Validate we have an executable provided, and args and optional +func (a *ApplicationAddon) Validate() bool { + if a.image == "" { + logger.Error("The application addon requires a container 'image'.") + return false + } + if a.name == "" { + a.name = "app-addon" + } + if a.command == "" { + logger.Error("The application addon requires a container 'command'.") + return false + } + return true +} + +// AssembleContainers adds the addon application container +func (a ApplicationAddon) AssembleContainers() []specs.ContainerSpec { + return []specs.ContainerSpec{{ + Image: a.image, + Name: a.name, + WorkingDir: a.workdir, + Command: strings.Split(a.command, " "), + // TODO these need to be mapped from m.resources + Resources: &api.ContainerResources{}, + Attributes: &api.ContainerSpec{ + SecurityContext: api.SecurityContext{ + Privileged: a.privileged, + // TODO add the caps here ptrace admin + }, + }, + }} +} + +func (m ApplicationAddon) Family() string { + return AddonFamilyApplication +} + +// Set custom options / attributes for the metric +func (a *ApplicationAddon) SetDefaultOptions(metric *api.MetricAddon) { + a.resources = map[string]map[string]intstr.IntOrString{} + + image, ok := metric.Options["image"] + if ok { + a.image = image.StrVal + } + command, ok := metric.Options["command"] + if ok { + a.command = command.StrVal + } + entrypoint, ok := metric.Options["entrypoint"] + if ok { + a.entrypoint = entrypoint.StrVal + } + pullSecret, ok := metric.Options["pullSecret"] + if ok { + a.pullSecret = pullSecret.StrVal + } + workdir, ok := metric.Options["workdir"] + if ok { + a.workdir = workdir.StrVal + } + priv, ok := metric.Options["privileged"] + if ok { + if priv.StrVal == "true" || priv.StrVal == "yes" { + a.privileged = true + } + } + resources, ok := metric.MapOptions["resourceLimits"] + if ok { + a.resources["limits"] = map[string]intstr.IntOrString{} + for key, value := range resources { + a.resources["limits"][key] = value + } + } + resources, ok = metric.MapOptions["resourceRequests"] + if ok { + a.resources["requests"] = map[string]intstr.IntOrString{} + for key, value := range resources { + a.resources["requests"][key] = value + } + } + if a.entrypoint == "" { + a.setDefaultEntrypoint() + } +} + +// Set the default entrypoint +func (a *ApplicationAddon) setDefaultEntrypoint() { + a.entrypoint = fmt.Sprintf("/metrics_operator/%s-entrypoint.sh", a.Identifier) +} + +// Calling the default allows a custom application that uses this to do the same +func (a *ApplicationAddon) SetOptions(metric *api.MetricAddon) { + a.SetDefaultOptions(metric) +} + +// Underlying function that can be shared +func (a *ApplicationAddon) DefaultOptions() map[string]intstr.IntOrString { + values := map[string]intstr.IntOrString{ + "image": intstr.FromString(a.image), + "workdir": intstr.FromString(a.workdir), + "entrypoint": intstr.FromString(a.entrypoint), + "command": intstr.FromString(a.command), + } + if a.privileged { + values["privileged"] = intstr.FromString("true") + } else { + values["privileged"] = intstr.FromString("false") + } + return values +} + +// Exported options and list options +func (a *ApplicationAddon) Options() map[string]intstr.IntOrString { + return a.DefaultOptions() +} + +// Return formatted map options +func (a *ApplicationAddon) MapOptions() map[string]map[string]intstr.IntOrString { + requests := map[string]intstr.IntOrString{} + limits := map[string]intstr.IntOrString{} + for k, value := range a.resources["limits"] { + limits[k] = value + } + for k, value := range a.resources["requests"] { + requests[k] = value + } + return map[string]map[string]intstr.IntOrString{ + "resourceLimits": limits, + "resourceRequests": requests, + } +} + +func init() { + + // Config map volume type + base := AddonBase{ + Identifier: "application", + Summary: "basic application (container) type", + } + app := ApplicationAddon{AddonBase: base} + Register(&app) +} diff --git a/pkg/addons/hpctoolkit.go b/pkg/addons/hpctoolkit.go new file mode 100644 index 0000000..d8c539e --- /dev/null +++ b/pkg/addons/hpctoolkit.go @@ -0,0 +1,408 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package addons + +import ( + "fmt" + "path/filepath" + "strings" + + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/metadata" + "github.com/converged-computing/metrics-operator/pkg/specs" + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/util/intstr" + jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" +) + +// HPCToolkit is an addon that provides a container to collect performance metrics +// Commands to interact with output data +// hpcstruct hpctoolkit-sleep-measurements +// hpcprof hpctoolkit-sleep-measurements +// hpcviewer ./hpctoolkit-lmp-database + +type HPCToolkit struct { + ApplicationAddon + + // Target is the name of the replicated job to customize entrypoint logic for + target string + + // Output files + // This is the main output file, and then the database is this + -database + output string + + // Run a post analysis with hpcstruct and hpcprof to generate a database + postAnalysis bool + + // ContainerTarget is the name of the container to add the entrypoint logic to + containerTarget string + events string + mount string + entrypointPath string + volumeName string + + // For mpirun and similar, mpirun needs to wrap hpcrun and the command, e.g., + // mpirun hpcrun + prefix string +} + +func (m HPCToolkit) Family() string { + return AddonFamilyPerformance +} + +// AssembleVolumes to provide an empty volume for the application to share +// We also need to provide a config map volume for our container spec +func (m HPCToolkit) AssembleVolumes() []specs.VolumeSpec { + volume := corev1.Volume{ + Name: m.volumeName, + VolumeSource: corev1.VolumeSource{ + EmptyDir: &corev1.EmptyDirVolumeSource{}, + }, + } + + // Prepare items as key to path + items := []corev1.KeyToPath{ + { + Key: m.volumeName, + Path: filepath.Base(m.entrypointPath), + }, + } + + // This is a config map volume with items + // It needs to be created in the same metrics operator namespace + // Thus we only need the items! + configVolume := corev1.Volume{ + VolumeSource: corev1.VolumeSource{ + ConfigMap: &corev1.ConfigMapVolumeSource{ + Items: items, + }, + }, + } + + // EmptyDir should be ReadOnly False, and we don't need a mount for it + return []specs.VolumeSpec{ + { + Volume: volume, + Mount: true, + Path: m.mount, + }, + + // Mount is set to false here because we mount via metrics_operator + { + Volume: configVolume, + ReadOnly: true, + Mount: false, + Path: filepath.Dir(m.entrypointPath), + }, + } +} + +// Validate we have an executable provided, and args and optional +func (a *HPCToolkit) Validate() bool { + if a.events == "" { + logger.Error("The HPCtoolkit application addon requires one or more 'events' for hpcrun (e.g., -e IO).") + return false + } + return true +} + +// Set custom options / attributes for the metric +func (a *HPCToolkit) SetOptions(metric *api.MetricAddon) { + + a.entrypointPath = "/metrics_operator/hpctoolkit-entrypoint.sh" + a.image = "ghcr.io/converged-computing/metric-hpctoolkit-view:ubuntu" + a.SetDefaultOptions(metric) + a.mount = "/opt/share" + a.volumeName = "hpctoolkit" + a.output = "hpctoolkit-result" + a.postAnalysis = true + + // UseColor set to anything means to use it + output, ok := metric.Options["output"] + if ok { + a.output = output.StrVal + } + mount, ok := metric.Options["mount"] + if ok { + a.mount = mount.StrVal + } + prefix, ok := metric.Options["prefix"] + if ok { + a.prefix = prefix.StrVal + } + workdir, ok := metric.Options["workdir"] + if ok { + a.workdir = workdir.StrVal + } + target, ok := metric.Options["target"] + if ok { + a.target = target.StrVal + } + ctarget, ok := metric.Options["containerTarget"] + if ok { + a.containerTarget = ctarget.StrVal + } + events, ok := metric.Options["events"] + if ok { + a.events = events.StrVal + } + image, ok := metric.Options["image"] + if ok { + a.image = image.StrVal + } + // This will work via a ssh command + postAnalysis, ok := metric.Options["postAnalysis"] + if ok { + if postAnalysis.StrVal == "no" || postAnalysis.StrVal == "false" { + a.postAnalysis = false + } + } +} + +// Exported options and list options +func (a *HPCToolkit) Options() map[string]intstr.IntOrString { + options := a.DefaultOptions() + options["events"] = intstr.FromString(a.events) + options["mount"] = intstr.FromString(a.mount) + options["prefix"] = intstr.FromString(a.prefix) + return options +} + +// CustomizeEntrypoint scripts +func (a *HPCToolkit) CustomizeEntrypoints( + cs []*specs.ContainerSpec, + rjs []*jobset.ReplicatedJob, +) { + for _, rj := range rjs { + + // Only customize if the replicated job name matches the target + if a.target != "" && a.target != rj.Name { + continue + } + a.customizeEntrypoint(cs, rj) + } + +} + +// CustomizeEntrypoint for a single replicated job +func (a *HPCToolkit) customizeEntrypoint( + cs []*specs.ContainerSpec, + rj *jobset.ReplicatedJob, +) { + + // Generate addon metadata + meta := Metadata(a) + + // This should be run after the pre block of the script + preBlock := ` +echo "%s" +# Ensure hpcrun and software exists. This is rough, but should be OK with enough wait time +wget https://github.com/converged-computing/goshare/releases/download/2023-09-06/wait-fs +chmod +x ./wait-fs +mv ./wait-fs /usr/bin/goshare-wait-fs + +# Ensure spack view is on the path, wherever it is mounted +viewbase="%s" +software="${viewbase}/software" +viewbin="${viewbase}/view/bin" +hpcrunpath=${viewbin}/hpcrun + +# Important to add AFTER in case software in container duplicated +export PATH=$PATH:${viewbin} + +# Wait for software directory, and give it time +goshare-wait-fs -p ${software} + +# Wait for copy to finish +sleep 10 + +# Copy mount software to /opt/software +cp -R %s/software /opt/software + +# Wait for hpcrun and marker to indicate copy is done +goshare-wait-fs -p ${viewbin}/hpcrun +goshare-wait-fs -p ${viewbase}/metrics-operator-done.txt + +# A small extra wait time to be conservative +sleep 5 + +# This will work with capability SYS_ADMIN added. +# It will only work with privileged set to true AT YOUR OWN RISK! +echo "-1" | tee /proc/sys/kernel/perf_event_paranoid + +# The output path for the analysis +output="%s" + +# Run hpcrun. See options with hpcrun -L +events="%s" + +# Write a script to run for the post block analysis +here=$(pwd) +cat < ./post-run.sh +#!/bin/bash +# Input path should be consistent between nodes +cd ${here} +${viewbin}/hpcstruct ${output} +${viewbin}/hpcprof -o ${output}-database ${output} +EOF +chmod +x ./post-run.sh + +echo "%s" +echo "%s" +` + preBlock = fmt.Sprintf( + preBlock, + meta, + a.mount, + a.mount, + a.output, + a.events, + metadata.CollectionStart, + metadata.Separator, + ) + + // postBlock to possibly run the hpcstruct command should come right after + postBlock := "" + if a.postAnalysis { + postBlock = ` +for host in $(cat ./hostlist.txt); do + echo "Running post analysis for host ${host}" + if [[ "$host" == "$(hostname)" ]]; then + bash ./post-run.sh + else + ssh ${host} ${workdir}/post-run.sh + fi +done +echo "METRICS-OPERATOR HPCTOOLKIT Post analysis done." +` + } + + // Add the working directory, if defined + if a.workdir != "" { + preBlock += fmt.Sprintf(` +workdir="%s" +echo "Changing directory to ${workdir}" +cd ${workdir} +`, a.workdir) + } + + // We use container names to target specific entrypoint scripts here + for _, containerSpec := range cs { + + // First check - is this the right replicated job? + if containerSpec.JobName != rj.Name { + continue + } + + // Always copy over the pre block - we need the logic to copy software + containerSpec.EntrypointScript.Pre += "\n" + preBlock + + // Next check if we have a target set (for the container) + if a.containerTarget != "" && containerSpec.Name != "" && a.containerTarget != containerSpec.Name { + continue + } + + // If the post command ends with sleep infinity, tweak it + isInteractive, updatedPost := deriveUpdatedPost(containerSpec.EntrypointScript.Post) + containerSpec.EntrypointScript.Post = updatedPost + + // The post to run the command across nodes (when the application finishes) + containerSpec.EntrypointScript.Post = containerSpec.EntrypointScript.Post + "\n" + postBlock + containerSpec.EntrypointScript.Command = fmt.Sprintf("%s $hpcrunpath -o $output $events %s", a.prefix, containerSpec.EntrypointScript.Command) + + // If is interactive, add back sleep infinity + if isInteractive { + containerSpec.EntrypointScript.Post += "\nsleep infinity\n" + } + } +} + +// update a post command to not end in sleep +func deriveUpdatedPost(post string) (bool, string) { + if strings.HasSuffix(post, "sleep infinity\n") { + updated := strings.Split(post, "\n") + // This is actually two lines + updated = updated[:len(updated)-2] + return true, strings.Join(updated, "\n") + } + return false, post +} + +// Generate a container spec that will map to a listing of containers for the replicated job +func (a *HPCToolkit) AssembleContainers() []specs.ContainerSpec { + + // The entrypoint script + // This is the addon container entrypoint, we don't care about metadata here + // The sole purpose is just to provide the volume, meaning copying content there + template := `#!/bin/bash + +echo "Moving content from /opt/view to be in shared volume at %s" +view=$(ls /opt/views/._view/) +view="/opt/views/._view/${view}" + +# Give a little extra wait time +sleep 10 + +viewroot="%s" +mkdir -p $viewroot/view +# We have to move both of these paths, *sigh* +cp -R ${view}/* $viewroot/view +cp -R /opt/software $viewroot/ + +# This is a marker to indicate the copy is done +touch $viewroot/metrics-operator-done.txt + +# Sleep forever, the application needs to run and end +echo "Sleeping forever so %s can be shared and use for hpctoolkit." +sleep infinity +` + script := fmt.Sprintf( + template, + a.mount, + a.mount, + a.mount, + ) + + // Leave the name empty to generate in the namespace of the metric set (e.g., set.Name) + entrypoint := specs.EntrypointScript{ + Name: a.volumeName, + Path: a.entrypointPath, + Script: filepath.Base(a.entrypointPath), + Pre: script, + } + + // The resource spec and attributes for now are empty (might redo this design) + // We assume they inherit the resources / attributes of the pod for now + // We don't use JobName here because we don't associate addon containers + // with other addon entrypoints + return []specs.ContainerSpec{ + { + Image: a.image, + Name: "hpctoolkit", + EntrypointScript: entrypoint, + Resources: &api.ContainerResources{}, + Attributes: &api.ContainerSpec{ + SecurityContext: api.SecurityContext{ + Privileged: a.privileged, + }, + }, + // We need to write this config map! + NeedsWrite: true, + }, + } +} + +func init() { + base := AddonBase{ + Identifier: "perf-hpctoolkit", + Summary: "performance tools for measurement and analysis", + } + app := ApplicationAddon{AddonBase: base} + HPCToolkit := HPCToolkit{ApplicationAddon: app} + Register(&HPCToolkit) +} diff --git a/pkg/addons/logs.go b/pkg/addons/logs.go new file mode 100644 index 0000000..e35ef6a --- /dev/null +++ b/pkg/addons/logs.go @@ -0,0 +1,51 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package addons + +import ( + "encoding/json" + "fmt" + "log" + + "github.com/converged-computing/metrics-operator/pkg/metadata" + "github.com/converged-computing/metrics-operator/pkg/utils" + "go.uber.org/zap" +) + +// Consistent logging identifiers that should be echoed to have newline after +var ( + logger *zap.SugaredLogger +) + +// Default metadata (in JSON) to also put at the top of addons +// That append to an entrypoint with their metadata +func Metadata(a Addon) string { + + export := metadata.MetricExport{ + MetricName: a.Name(), + MetricDescription: a.Description(), + MetricOptions: a.Options(), + MetricListOptions: a.ListOptions(), + } + meta, err := json.Marshal(export) + if err != nil { + logger.Errorf("Warning, error serializing spec metadata: %s", err.Error()) + } + // We need to escape the quotes for printing in bash + metadataEscaped := utils.EscapeCharacters(string(meta)) + return fmt.Sprintf("ADDON METADATA START %s\nADDON METADATA END", metadataEscaped) +} + +func init() { + handle, err := zap.NewProduction() + if err != nil { + log.Fatalf("can't initialize zap logger: %v", err) + } + logger = handle.Sugar() + defer handle.Sync() +} diff --git a/pkg/addons/volumes.go b/pkg/addons/volumes.go new file mode 100644 index 0000000..cbad87f --- /dev/null +++ b/pkg/addons/volumes.go @@ -0,0 +1,393 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package addons + +import ( + "fmt" + "math/rand" + "path/filepath" + + corev1 "k8s.io/api/core/v1" + + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/specs" + "k8s.io/apimachinery/pkg/util/intstr" +) + +type VolumeBase struct { + AddonBase + readOnly bool + name string + path string +} + +func (m VolumeBase) Family() string { + return AddonFamilyVolume +} + +func (v *VolumeBase) DefaultValidate() bool { + + // We require the user to provide a name to ensure they enforce uniqueness + if v.name == "" { + logger.Error("🟥️ All volume addons require a 'name' for a unique container mount.") + return false + } + if v.path == "" { + logger.Error("🟥️ All volume addons require a 'path' for the container mount.") + return false + } + return true +} + +// If not provided, generate a name for the volume +func (v *VolumeBase) generateName() string { + number := rand.Intn(10000) + return fmt.Sprintf("%s-%d", v.name, number) +} + +// DefaultSetOptions across volume types for shared attributes +func (v *VolumeBase) DefaultSetOptions(metric *api.MetricAddon) { + + // ConfigMap names + name, ok := metric.Options["name"] + if ok { + v.name = name.StrVal + } + path, ok := metric.Options["path"] + if ok { + v.path = path.StrVal + } + readOnly, ok := metric.Options["readOnly"] + if ok { + if readOnly.StrVal == "yes" || readOnly.StrVal == "true" { + v.readOnly = true + } + } +} + +// A general metric is a container added to a JobSet +type ConfigMapVolume struct { + VolumeBase + + // Config map name is required for an existing config map + // The metrics operator does not create it for you! + configMapName string + + // Items (key and paths) for the config map + items map[string]string +} + +// Validate we have an executable provided, and args and optional +func (v *ConfigMapVolume) Validate() bool { + if v.configMapName == "" { + logger.Error("🟥️ The volume-cm volume addon requires a 'configMapName' for the existing config map.") + return false + } + if len(v.items) == 0 { + logger.Error("🟥️ The volume-cm volume addon requires at least one entry in mapOptions->items, with key value pairs.") + return false + } + return v.DefaultValidate() +} + +// Set custom options / attributes for the metric +func (v *ConfigMapVolume) SetOptions(metric *api.MetricAddon) { + + // Set an empty list of items + v.items = map[string]string{} + + name, ok := metric.Options["configMapName"] + if ok { + v.configMapName = name.StrVal + } + + // Items for the config map + items, ok := metric.MapOptions["items"] + if ok { + for k, value := range items { + v.items[k] = value.StrVal + } + } + v.DefaultSetOptions(metric) +} + +// Exported options and list options +func (v *ConfigMapVolume) Options() map[string]intstr.IntOrString { + return map[string]intstr.IntOrString{ + "path": intstr.FromString(v.path), + "name": intstr.FromString(v.name), + "configMapName": intstr.FromString(v.configMapName), + } +} + +// Return formatted map options +func (v *ConfigMapVolume) MapOptions() map[string]map[string]intstr.IntOrString { + items := map[string]intstr.IntOrString{} + for k, value := range v.items { + items[k] = intstr.FromString(value) + } + return map[string]map[string]intstr.IntOrString{ + "items": items, + } +} + +// AssembleVolumes for a config map +func (v *ConfigMapVolume) AssembleVolumes() []specs.VolumeSpec { + + // Prepare items as key to path + items := []corev1.KeyToPath{} + for key, path := range v.items { + newItem := corev1.KeyToPath{ + Key: key, + Path: path, + } + items = append(items, newItem) + } + + // This is a config map volume with items + newVolume := corev1.Volume{ + Name: v.name, + VolumeSource: corev1.VolumeSource{ + ConfigMap: &corev1.ConfigMapVolumeSource{ + LocalObjectReference: corev1.LocalObjectReference{ + Name: v.configMapName, + }, + Items: items, + }, + }, + } + + // ConfigMaps have to be read only! + return []specs.VolumeSpec{{ + Volume: newVolume, + Path: filepath.Dir(v.path), + ReadOnly: true, + Mount: true, + }} +} + +// An existing peristent volume claim +type PersistentVolumeClaim struct { + VolumeBase + + // Path and claim name are always required if a secret isn't defined + claimName string +} + +// Validate we have an executable provided, and args and optional +func (v *PersistentVolumeClaim) Validate() bool { + if v.claimName == "" { + logger.Error("🟥️ The volume-pvc volume addon requires a 'claimName' for the existing persistent volume claim (pvc).") + return false + } + return v.DefaultValidate() +} + +// Set custom options / attributes +func (v *PersistentVolumeClaim) SetOptions(metric *api.MetricAddon) { + claimName, ok := metric.Options["claimName"] + if ok { + v.claimName = claimName.StrVal + } + v.DefaultSetOptions(metric) +} + +// AssembleVolumes for a pvc +func (v *PersistentVolumeClaim) AssembleVolumes() []specs.VolumeSpec { + volume := corev1.Volume{ + Name: v.name, + VolumeSource: corev1.VolumeSource{ + PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{ + ClaimName: v.claimName, + }, + }, + } + + // ConfigMaps have to be read only! + return []specs.VolumeSpec{{ + Volume: volume, + Path: filepath.Dir(v.path), + ReadOnly: v.readOnly, + Mount: true, + }} +} + +// An existing secret +type SecretVolume struct { + VolumeBase + + // secret name is required + secretName string +} + +// Validate we have an executable provided, and args and optional +func (v *SecretVolume) Validate() bool { + if v.secretName == "" { + logger.Error("🟥️ The volume-secret addon requires a 'secretName' for the existing secret.") + return false + } + return v.DefaultValidate() +} + +// Set custom options / attributes +func (v *SecretVolume) SetOptions(metric *api.MetricAddon) { + secretName, ok := metric.Options["secretName"] + if ok { + v.secretName = secretName.StrVal + } + v.DefaultSetOptions(metric) +} + +// AssembleVolumes for a Secret +func (v *SecretVolume) AssembleVolumes() []specs.VolumeSpec { + volume := corev1.Volume{ + Name: v.name, + VolumeSource: corev1.VolumeSource{ + Secret: &corev1.SecretVolumeSource{ + SecretName: v.secretName, + }, + }, + } + return []specs.VolumeSpec{{ + Volume: volume, + ReadOnly: v.readOnly, + Path: v.path, + Mount: true, + }} +} + +// A hostPath volume +type HostPathVolume struct { + VolumeBase + + // only the hostpath and name are required + hostPath string +} + +// Validate we have an executable provided, and args and optional +func (v *HostPathVolume) Validate() bool { + if v.hostPath == "" { + logger.Error("🟥️ The volume-hostpath addon requires a 'hostPath' for the host path.") + return false + } + return v.DefaultValidate() +} + +// Set custom options / attributes +func (v *HostPathVolume) SetOptions(metric *api.MetricAddon) { + + // Name is required! + path, ok := metric.Options["hostPath"] + if ok { + v.hostPath = path.StrVal + } + v.DefaultSetOptions(metric) +} + +// AssembleVolumes for a host volume +func (v *HostPathVolume) AssembleVolumes() []specs.VolumeSpec { + volume := corev1.Volume{ + Name: v.name, + VolumeSource: corev1.VolumeSource{ + HostPath: &corev1.HostPathVolumeSource{ + Path: v.hostPath, + }, + }, + } + return []specs.VolumeSpec{{ + Volume: volume, + Mount: true, + Path: v.path, + ReadOnly: v.readOnly, + }} +} + +// An empty volume requires nothing! Nice! +type EmptyVolume struct { + VolumeBase +} + +// Validate we have an executable provided, and args and optional +func (v *EmptyVolume) Validate() bool { + return v.DefaultValidate() +} + +// Set custom options / attributes +func (v *EmptyVolume) SetOptions(metric *api.MetricAddon) { + name, ok := metric.Options["name"] + if ok { + v.name = name.StrVal + } +} + +// AssembleVolumes for an empty volume +func (v *EmptyVolume) AssembleVolumes() []specs.VolumeSpec { + volume := corev1.Volume{ + Name: v.name, + VolumeSource: corev1.VolumeSource{ + EmptyDir: &corev1.EmptyDirVolumeSource{}, + }, + } + return []specs.VolumeSpec{{ + Volume: volume, + Mount: true, + Path: v.path, + ReadOnly: v.readOnly, + }} +} + +// TODO likely we need to carry around entrypoints to customize? + +func init() { + + // Config map volume type + base := AddonBase{ + Identifier: "volume-cm", + Summary: "config map volume type", + } + volBase := VolumeBase{AddonBase: base} + vol := ConfigMapVolume{VolumeBase: volBase} + Register(&vol) + + // Secret volume type + base = AddonBase{ + Identifier: "volume-secret", + Summary: "secret volume type", + } + volBase = VolumeBase{AddonBase: base} + secretVol := SecretVolume{VolumeBase: volBase} + Register(&secretVol) + + // Hostpath volume type + base = AddonBase{ + Identifier: "volume-hostpath", + Summary: "host path volume type", + } + volBase = VolumeBase{AddonBase: base} + hostVol := HostPathVolume{VolumeBase: volBase} + Register(&hostVol) + + // persistent volume claim volume type + base = AddonBase{ + Identifier: "volume-pvc", + Summary: "persistent volume claim volume type", + } + volBase = VolumeBase{AddonBase: base} + pvcVol := PersistentVolumeClaim{VolumeBase: volBase} + Register(&pvcVol) + + // EmptyVolume + base = AddonBase{ + Identifier: "volume-empty", + Summary: "empty volume type", + } + volBase = VolumeBase{AddonBase: base} + emptyVol := EmptyVolume{VolumeBase: volBase} + Register(&emptyVol) + +} diff --git a/pkg/jobs/application.go b/pkg/jobs/application.go deleted file mode 100644 index 45b5b52..0000000 --- a/pkg/jobs/application.go +++ /dev/null @@ -1,115 +0,0 @@ -/* -Copyright 2023 Lawrence Livermore National Security, LLC - (c.f. AUTHORS, NOTICE.LLNS, COPYING) - -SPDX-License-Identifier: MIT -*/ - -package jobs - -import ( - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - metrics "github.com/converged-computing/metrics-operator/pkg/metrics" - "k8s.io/apimachinery/pkg/util/intstr" - jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" -) - -// These are common templates for application metrics - -// SingleApplication is a Metric base for a simple application metric -// be accessible by other packages (and not conflict with function names) -type SingleApplication struct { - Identifier string - Summary string - Container string - Workdir string - ResourceSpec *api.ContainerResources - AttributeSpec *api.ContainerSpec -} - -// Name returns the metric name -func (m SingleApplication) Name() string { - return m.Identifier -} - -func (m SingleApplication) GetVolumes() map[string]api.Volume { - return map[string]api.Volume{} -} - -func (m SingleApplication) HasSoleTenancy() bool { - return false -} - -// Description returns the metric description -func (m SingleApplication) Description() string { - return m.Summary -} - -// Default SingleApplication is generic performance family -func (m SingleApplication) Family() string { - return metrics.PerformanceFamily -} - -// Return container resources for the metric container -func (m SingleApplication) Resources() *api.ContainerResources { - return m.ResourceSpec -} -func (m SingleApplication) Attributes() *api.ContainerSpec { - return m.AttributeSpec -} - -// Validation -func (m SingleApplication) Validate(spec *api.MetricSet) bool { - return true -} - -// If we have an application container, return that plus custom logic -// custom: is any custom code (environment, waiting, etc.) -// prefix: is a wrapper to the actual entrypoint command -func (m SingleApplication) ApplicationEntrypoint( - spec *api.MetricSet, - custom string, - prefix string, - suffix string, -) metrics.EntrypointScript { - - template := `#!/bin/bash` - - // If we have custom logic (environment, sleep, etc) add it here - if custom != "" { - template = template + "\n" + custom - } - // Add the actual entrypoint - template = template + "\n" + prefix + " " + spec.Spec.Application.Entrypoint + "\n" + suffix - - // If we do, add the custom logic first - return metrics.EntrypointScript{ - Script: template, - Path: metrics.DefaultApplicationEntrypoint, - Name: metrics.DefaultApplicationName, - } -} - -// Container variables -func (m SingleApplication) Image() string { - return m.Container -} -func (m SingleApplication) WorkingDir() string { - return m.Workdir -} - -func (m SingleApplication) ReplicatedJobs(spec *api.MetricSet) ([]jobset.ReplicatedJob, error) { - return []jobset.ReplicatedJob{}, nil -} - -func (m SingleApplication) ListOptions() map[string][]intstr.IntOrString { - return map[string][]intstr.IntOrString{} -} - -func (m SingleApplication) SuccessJobs() []string { - return []string{} -} - -func (m SingleApplication) Type() string { - return metrics.ApplicationMetric -} diff --git a/pkg/jobs/launcher.go b/pkg/jobs/launcher.go deleted file mode 100644 index f18fc2c..0000000 --- a/pkg/jobs/launcher.go +++ /dev/null @@ -1,328 +0,0 @@ -/* -Copyright 2023 Lawrence Livermore National Security, LLC - (c.f. AUTHORS, NOTICE.LLNS, COPYING) - -SPDX-License-Identifier: MIT -*/ - -package jobs - -import ( - "fmt" - "path" - "path/filepath" - "strings" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - corev1 "k8s.io/api/core/v1" - "k8s.io/apimachinery/pkg/util/intstr" - jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" - - metrics "github.com/converged-computing/metrics-operator/pkg/metrics" -) - -// These are common templates for standalone apps. -// They define the interface of a Metric. - -// These are used for network and job names, etc. -var ( - defaultLauncherLetter = "l" - defaultWorkerLetter = "w" -) - -// LauncherWorker is a launcher + worker setup for apps. These need to -// be accessible by other packages (and not conflict with function names) -type LauncherWorker struct { - Identifier string - Summary string - Container string - Workdir string - ResourceSpec *api.ContainerResources - AttributeSpec *api.ContainerSpec - - // If we ask for sole tenancy, we assign 1 pod / hostname - SoleTenancy bool - - // Scripts - WorkerScript string - LauncherScript string - LauncherLetter string - WorkerLetter string -} - -func (m LauncherWorker) HasSoleTenancy() bool { - return m.SoleTenancy -} - -// Name returns the metric name -func (m LauncherWorker) Name() string { - return m.Identifier -} - -// GetVolumes (if necessary) this is likely only for application metric types -func (m LauncherWorker) GetVolumes() map[string]api.Volume { - return map[string]api.Volume{} -} - -// Description returns the metric description -func (m LauncherWorker) Description() string { - return m.Summary -} - -// Family returns a generic performance family -func (m LauncherWorker) Family() string { - return metrics.PerformanceFamily -} - -// Jobs required for success condition (n is the LauncherWorker run) -func (m *LauncherWorker) SuccessJobs() []string { - m.ensureDefaultNames() - return []string{m.LauncherLetter} -} - -// Container variables -func (n LauncherWorker) Type() string { - return metrics.StandaloneMetric -} -func (n LauncherWorker) Image() string { - return n.Container -} -func (m LauncherWorker) WorkingDir() string { - return m.Workdir -} - -// Return container resources for the metric container -func (m LauncherWorker) Resources() *api.ContainerResources { - return m.ResourceSpec -} -func (m LauncherWorker) Attributes() *api.ContainerSpec { - return m.AttributeSpec -} - -func (m LauncherWorker) getMetricsKeyToPath() []corev1.KeyToPath { - // Runner start scripts - makeExecutable := int32(0777) - - // Each metric has an entrypoint script - return []corev1.KeyToPath{ - { - Key: deriveScriptKey(m.LauncherScript), - Path: path.Base(m.LauncherScript), - Mode: &makeExecutable, - }, - { - Key: deriveScriptKey(m.WorkerScript), - Path: path.Base(m.WorkerScript), - Mode: &makeExecutable, - }, - } -} - -// Ensure the worker and launcher default names are set -func (m *LauncherWorker) ensureDefaultNames() { - // Ensure we set the default launcher letter, if not set - if m.LauncherLetter == "" { - m.LauncherLetter = defaultLauncherLetter - } - if m.WorkerLetter == "" { - m.WorkerLetter = defaultWorkerLetter - } - if m.LauncherScript == "" { - m.LauncherScript = "/metrics_operator/launcher.sh" - } - if m.WorkerScript == "" { - m.WorkerScript = "/metrics_operator/worker.sh" - } -} - -// GetCommonPrefix returns a common prefix for the worker/ launcher script, setting up hosts, etc. -func (m LauncherWorker) GetCommonPrefix( - metadata string, - command string, - hosts string, -) string { - - // Generate problem.sh with command only if we have one! - if command != "" { - command = fmt.Sprintf(`# Write the command file -cat < ./problem.sh -#!/bin/bash -%s -EOF -chmod +x ./problem.sh`, command) - } - - prefixTemplate := `#!/bin/bash -# Start ssh daemon -/usr/sbin/sshd -D & -echo "%s" -# Change directory to where we will run (and write hostfile) -cd %s -# Write the hosts file -cat < ./hostlist.txt -%s -EOF - -%s - -# Allow network to ready (this could be a variable) -echo "Sleeping for 10 seconds waiting for network..." -sleep 10 -echo "%s" -` - return fmt.Sprintf( - prefixTemplate, - metadata, - m.WorkingDir(), - hosts, - command, - metrics.CollectionStart, - ) -} - -// AddWorkers generates worker jobs, only if we have them -func (m *LauncherWorker) AddWorkers( - spec *api.MetricSet, - v map[string]api.Volume, - volumes []corev1.Volume, -) (*jobset.ReplicatedJob, error) { - - numWorkers := spec.Spec.Pods - 1 - workers, err := metrics.GetReplicatedJob(spec, false, numWorkers, numWorkers, m.WorkerLetter, m.SoleTenancy) - if err != nil { - return workers, err - } - workers.Template.Spec.Template.Spec.Volumes = volumes - - // ContainerSpec for workers - workerSpec := []metrics.ContainerSpec{ - { - Image: m.Container, - Name: "workers", - Command: []string{"/bin/bash", m.WorkerScript}, - Resources: m.ResourceSpec, - Attributes: m.AttributeSpec, - }, - } - // Prepare containers for workers to add to replicated job - workerContainers, err := metrics.GetContainers(spec, workerSpec, v, false, false) - if err != nil { - fmt.Printf("issue creating worker containers %s", err) - return workers, err - } - workers.Template.Spec.Template.Spec.Containers = workerContainers - return workers, nil -} - -// Replicated Jobs are custom for this standalone metric -func (m *LauncherWorker) ReplicatedJobs(spec *api.MetricSet) ([]jobset.ReplicatedJob, error) { - - js := []jobset.ReplicatedJob{} - m.ensureDefaultNames() - - // Generate a replicated job for the launcher (LauncherWorker) and workers - launcher, err := metrics.GetReplicatedJob(spec, false, 1, 1, m.LauncherLetter, m.SoleTenancy) - if err != nil { - return js, err - } - - // Add volumes defined under storage. - v := map[string]api.Volume{} - if spec.HasStorage() { - v["storage"] = spec.Spec.Storage.Volume - } - - // runnerScripts are custom for a LauncherWorker jobset - runnerScripts := m.getMetricsKeyToPath() - volumes := metrics.GetVolumes(spec, runnerScripts, v) - launcher.Template.Spec.Template.Spec.Volumes = volumes - - // Prepare container specs, one for launcher and one for workers - launcherSpec := []metrics.ContainerSpec{ - { - Image: m.Container, - Name: "launcher", - Command: []string{"/bin/bash", m.LauncherScript}, - Resources: m.ResourceSpec, - Attributes: m.AttributeSpec, - }, - } - - // Derive the containers, one per metric, and this includes mounts for volumes - // false and false is disabling shared process namespace and cap sys_admin - launcherContainers, err := metrics.GetContainers(spec, launcherSpec, v, false, false) - if err != nil { - fmt.Printf("issue creating launcher containers %s", err) - return js, err - } - launcher.Template.Spec.Template.Spec.Containers = launcherContainers - - numWorkers := spec.Spec.Pods - 1 - var workers *jobset.ReplicatedJob - - // Generate the replicated job with just a launcher, or launcher and workers - if numWorkers > 0 { - workers, err = m.AddWorkers(spec, v, volumes) - if err != nil { - return js, err - } - js = []jobset.ReplicatedJob{*launcher, *workers} - } else { - js = []jobset.ReplicatedJob{*launcher} - } - return js, nil -} - -func (m LauncherWorker) ListOptions() map[string][]intstr.IntOrString { - return map[string][]intstr.IntOrString{} -} - -// Validate that we can run a network. At least one launcher and worker is required -func (m LauncherWorker) Validate(spec *api.MetricSet) bool { - isValid := spec.Spec.Pods >= 2 - if !isValid { - logger.Errorf("Pods for a Launcher Worker app must be >=2. This app is invalid.") - } - return isValid -} - -// Given a full path, derive the key from the script name minus the extension -func deriveScriptKey(path string) string { - - // Basename - path = filepath.Base(path) - - // Remove the extension, and this assumes we don't have double . - return strings.Split(path, ".")[0] -} - -func (m LauncherWorker) FinalizeEntrypoints(launcherTemplate string, workerTemplate string) []metrics.EntrypointScript { - return []metrics.EntrypointScript{ - { - Name: deriveScriptKey(m.LauncherScript), - Path: m.LauncherScript, - Script: launcherTemplate, - }, - { - Name: deriveScriptKey(m.WorkerScript), - Path: m.WorkerScript, - Script: workerTemplate, - }, - } -} - -// Get common hostlist for launcher/worker app -func (m *LauncherWorker) GetHostlist(spec *api.MetricSet) string { - m.ensureDefaultNames() - - // The launcher has a different hostname, n for netmark - hosts := fmt.Sprintf("%s-%s-0-0.%s.%s.svc.cluster.local\n", - spec.Name, m.LauncherLetter, spec.Spec.ServiceName, spec.Namespace, - ) - // Add number of workers - for i := 0; i < int(spec.Spec.Pods-1); i++ { - hosts += fmt.Sprintf("%s-%s-0-%d.%s.%s.svc.cluster.local\n", - spec.Name, m.WorkerLetter, i, spec.Spec.ServiceName, spec.Namespace) - } - return hosts -} diff --git a/pkg/jobs/logs.go b/pkg/jobs/logs.go deleted file mode 100644 index 8198351..0000000 --- a/pkg/jobs/logs.go +++ /dev/null @@ -1,29 +0,0 @@ -/* -Copyright 2023 Lawrence Livermore National Security, LLC - (c.f. AUTHORS, NOTICE.LLNS, COPYING) - -SPDX-License-Identifier: MIT -*/ - -package jobs - -import ( - "log" - - "go.uber.org/zap" -) - -// Consistent logging identifiers that should be echoed to have newline after -var ( - handle *zap.Logger - logger *zap.SugaredLogger -) - -func init() { - handle, err := zap.NewProduction() - if err != nil { - log.Fatalf("can't initialize zap logger: %v", err) - } - logger = handle.Sugar() - defer handle.Sync() -} diff --git a/pkg/jobs/storage.go b/pkg/jobs/storage.go deleted file mode 100644 index 674a505..0000000 --- a/pkg/jobs/storage.go +++ /dev/null @@ -1,93 +0,0 @@ -/* -Copyright 2023 Lawrence Livermore National Security, LLC - (c.f. AUTHORS, NOTICE.LLNS, COPYING) - -SPDX-License-Identifier: MIT -*/ - -package jobs - -import ( - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - "k8s.io/apimachinery/pkg/util/intstr" - jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" - - metrics "github.com/converged-computing/metrics-operator/pkg/metrics" -) - -// These are common templates for storage apps. -// They define the interface of a Metric. - -type StorageGeneric struct { - Identifier string - Summary string - Container string - Workdir string - - ResourceSpec *api.ContainerResources - AttributeSpec *api.ContainerSpec -} - -// Name returns the metric name -func (m StorageGeneric) Name() string { - return m.Identifier -} - -// Family returns the storage family -func (m StorageGeneric) Family() string { - return metrics.StorageFamily -} - -func (m StorageGeneric) GetVolumes() map[string]api.Volume { - return map[string]api.Volume{} -} - -// Description returns the metric description -func (m StorageGeneric) Description() string { - return m.Summary -} - -// By default assume storage does not have sole tenancy -func (m StorageGeneric) HasSoleTenancy() bool { - return false -} - -// Container -func (m StorageGeneric) Image() string { - return m.Container -} - -// WorkingDir does not matter -func (m StorageGeneric) WorkingDir() string { - return m.Workdir -} - -// Return container resources for the metric container -func (m StorageGeneric) Resources() *api.ContainerResources { - return m.ResourceSpec -} -func (m StorageGeneric) Attributes() *api.ContainerSpec { - return m.AttributeSpec -} - -// Validation -func (m StorageGeneric) Validate(set *api.MetricSet) bool { - return true -} - -func (m StorageGeneric) ListOptions() map[string][]intstr.IntOrString { - return map[string][]intstr.IntOrString{} -} - -// Jobs required for success condition (n is the netmark run) -func (m StorageGeneric) SuccessJobs() []string { - return []string{} -} - -func (m StorageGeneric) Type() string { - return metrics.StorageMetric -} - -func (m StorageGeneric) ReplicatedJobs(set *api.MetricSet) ([]jobset.ReplicatedJob, error) { - return []jobset.ReplicatedJob{}, nil -} diff --git a/pkg/metadata/metadata.go b/pkg/metadata/metadata.go new file mode 100644 index 0000000..c4050a2 --- /dev/null +++ b/pkg/metadata/metadata.go @@ -0,0 +1,56 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package metadata + +import ( + "go.uber.org/zap" + "k8s.io/apimachinery/pkg/util/intstr" +) + +// Consistent logging identifiers that should be echoed to have newline after +var ( + Separator = "METRICS OPERATOR TIMEPOINT" + CollectionStart = "METRICS OPERATOR COLLECTION START" + CollectionEnd = "METRICS OPERATOR COLLECTION END" + handle *zap.Logger + logger *zap.SugaredLogger +) + +// Metric Export is a flattened structure with minimal required metadata for now +// It would be nice if we could just dump everything. +type MetricExport struct { + + // Global + Pods int32 `json:"pods"` + + // Application + ApplicationImage string `json:"applicationImage,omitempty"` + ApplicationCommand string `json:"applicationCommand,omitempty"` + + // Storage + StorageVolumePath string `json:"storageVolumePath,omitempty"` + StorageVolumeHostPath string `json:"storageVolumeHostPath,omitempty"` + StorageVolumeSecretName string `json:"storageVolumeSecretName,omitempty"` + StorageVolumeClaimName string `json:"storageVolumeClaimName,omitempty"` + StorageVolumeConfigMapName string `json:"storageVolumeConfigMapName,omitempty"` + + // Metric + MetricName string `json:"metricName,omitempty"` + MetricDescription string `json:"metricDescription,omitempty"` + MetricType string `json:"metricType,omitempty"` + MetricOptions map[string]intstr.IntOrString `json:"metricOptions,omitempty"` + MetricListOptions map[string][]intstr.IntOrString `json:"metricListOptions,omitempty"` +} + +// Interactive returns a sleep infinity if interactive is true +func Interactive(interactive bool) string { + if interactive { + return "sleep infinity" + } + return "" +} diff --git a/pkg/metrics/app/amg.go b/pkg/metrics/app/amg.go index eed1154..15dc22e 100644 --- a/pkg/metrics/app/amg.go +++ b/pkg/metrics/app/amg.go @@ -8,23 +8,21 @@ SPDX-License-Identifier: MIT package application import ( - "fmt" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" ) +const ( + amgIdentifier = "app-amg" + amgSummary = "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids" + amgContainer = "ghcr.io/converged-computing/metric-amg:latest" +) + // AMG is a launcher + workers metric application type AMG struct { - jobs.LauncherWorker - - // Custom Options - workdir string - command string - prefix string + metrics.LauncherWorker } func (m AMG) Url() string { @@ -38,27 +36,17 @@ func (m AMG) Family() string { // Set custom options / attributes for the metric func (m *AMG) SetOptions(metric *api.Metric) { - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes - // Set user defined values or fall back to defaults - m.prefix = "mpirun --hostfile ./hostlist.txt" - m.command = "amg" - m.workdir = "/opt/AMG" + // TODO change these to class varaibles? then set in two places... + m.Identifier = amgIdentifier + m.Summary = amgSummary + m.Container = amgContainer - // This could be improved :) - command, ok := metric.Options["command"] - if ok { - m.command = command.StrVal - } - workdir, ok := metric.Options["workdir"] - if ok { - m.workdir = workdir.StrVal - } - mpirun, ok := metric.Options["mpirun"] - if ok { - m.prefix = mpirun.StrVal - } + // Set user defined values or fall back to defaults + m.Prefix = "mpirun --hostfile ./hostlist.txt" + m.Command = "amg" + m.Workdir = "/opt/AMG" + m.SetDefaultOptions(metric) } // Validate that we can run AMG @@ -69,53 +57,19 @@ func (n AMG) Validate(spec *api.MetricSet) bool { // Exported options and list options func (m AMG) Options() map[string]intstr.IntOrString { return map[string]intstr.IntOrString{ - "command": intstr.FromString(m.command), - "mpirun": intstr.FromString(m.prefix), - "workdir": intstr.FromString(m.workdir), + "command": intstr.FromString(m.Command), + "prefix": intstr.FromString(m.Prefix), + "workdir": intstr.FromString(m.Workdir), } } -// Return lookup of entrypoint scripts -func (m AMG) EntrypointScripts( - spec *api.MetricSet, - metric *metrics.Metric, -) []metrics.EntrypointScript { - - // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) - hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, m.command, hosts) - - // Template for the launcher - template := ` -echo "%s" -%s ./problem.sh -echo "%s" -%s -` - launcherTemplate := prefix + fmt.Sprintf( - template, - metrics.Separator, - m.prefix, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - - // Return the script templates for each of launcher and worker - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) -} - func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-amg", - Summary: "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - Container: "ghcr.io/converged-computing/metric-amg:latest", - WorkerScript: "/metrics_operator/amg-worker.sh", - LauncherScript: "/metrics_operator/amg-launcher.sh", + base := metrics.BaseMetric{ + Identifier: amgIdentifier, + Summary: amgSummary, + Container: amgContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} amg := AMG{LauncherWorker: launcher} metrics.Register(&amg) } diff --git a/pkg/metrics/app/bdas.go b/pkg/metrics/app/bdas.go index 077c13c..d15a9c9 100644 --- a/pkg/metrics/app/bdas.go +++ b/pkg/metrics/app/bdas.go @@ -10,19 +10,22 @@ package application import ( "fmt" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" ) -type BDAS struct { - jobs.LauncherWorker +const ( + bdasIdentifier = "app-bdas" + bdasSummary = "The big data analytic suite contains the K-Means observation label, PCA, and SVM benchmarks." + bdasContainer = "ghcr.io/converged-computing/metric-bdas:latest" +) - // Custom Options - command string - prefix string +type BDAS struct { + metrics.LauncherWorker } // I think this is a simulation? @@ -36,99 +39,101 @@ func (m BDAS) Url() string { // Set custom options / attributes for the metric func (m *BDAS) SetOptions(metric *api.Metric) { - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes + + // Metadatqa + m.Identifier = bdasIdentifier + m.Summary = bdasSummary + m.Container = bdasContainer // Set user defined values or fall back to defaults - m.prefix = "/bin/bash" - m.command = "mpirun --allow-run-as-root -np 4 --hostfile ./hostlist.txt Rscript /opt/bdas/benchmarks/r/princomp.r 250 50" + m.Prefix = "/bin/bash" + m.Command = "mpirun --allow-run-as-root -np 4 --hostfile ./hostlist.txt Rscript /opt/bdas/benchmarks/r/princomp.r 250 50" m.Workdir = "/opt/bdas/benchmarks/r" // Examples from guide // mpirun -np num_ranks Rscript princomp.r num_local_rows num_global_cols // mpirun -np 16 Rscript princomp.r 1000 250 - - // This could be improved :) - command, ok := metric.Options["command"] - if ok { - m.command = command.StrVal - } - workdir, ok := metric.Options["workdir"] - if ok { - m.Workdir = workdir.StrVal - } - prefix, ok := metric.Options["prefix"] - if ok { - m.prefix = prefix.StrVal - } + m.SetDefaultOptions(metric) } // Exported options and list options func (m BDAS) Options() map[string]intstr.IntOrString { return map[string]intstr.IntOrString{ - "command": intstr.FromString(m.command), - "prefix": intstr.FromString(m.prefix), + "command": intstr.FromString(m.Command), + "prefix": intstr.FromString(m.Prefix), "workdir": intstr.FromString(m.Workdir), } } -// Return lookup of entrypoint scripts -func (m BDAS) EntrypointScripts( +func (m BDAS) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) + meta := metrics.Metadata(spec, metric) hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, m.command, hosts) + prefix := m.GetCommonPrefix(meta, m.Command, hosts) - // Template for the launcher - // TODO need to finish adding here when BDAS rebuild done - template := ` + preBlock := ` echo "%s" # We need ip addresses for openmpi mv ./hostlist.txt ./hostnames.txt for h in $(cat ./hostnames.txt); do if [[ "${h}" != "" ]]; then - if [[ "${h}" == "$(hostname)" ]]; then - hostname -I | awk '{print $1}' >> hostlist.txt - else - host $h | cut -d ' ' -f 4 >> hostlist.txt - fi + if [[ "${h}" == "$(hostname)" ]]; then + hostname -I | awk '{print $1}' >> hostlist.txt + else + host $h | cut -d ' ' -f 4 >> hostlist.txt + fi fi done echo "Hostlist" cat ./hostlist.txt -echo "%s" -echo "%s" -%s ./problem.sh +` + + postBlock := ` echo "%s" %s ` - launcherTemplate := prefix + fmt.Sprintf( - template, - metadata, - metrics.CollectionStart, - metrics.Separator, - m.prefix, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) + command := fmt.Sprintf("%s ./problem.sh", m.Prefix) + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + preBlock = prefix + fmt.Sprintf(preBlock, metadata.Separator) + postBlock = fmt.Sprintf(postBlock, metadata.CollectionEnd, interactive) + + // Entrypoint for the launcher + launcherEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.LauncherScript), + Path: m.LauncherScript, + Pre: preBlock, + Command: command, + Post: postBlock, + } + + // Entrypoint for the worker + workerEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.WorkerScript), + Path: m.WorkerScript, + Pre: prefix, + Command: "sleep infinity", + } + + // Container spec for the launcher + launcherContainer := m.GetLauncherContainerSpec(launcherEntrypoint) + workerContainer := m.GetWorkerContainerSpec(workerEntrypoint) + + // Return the script templates for each of launcher and worker + return []*specs.ContainerSpec{&launcherContainer, &workerContainer} } func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-bdas", - Summary: "The big data analytic suite contains the K-Means observation label, PCA, and SVM benchmarks.", - Container: "ghcr.io/converged-computing/metric-bdas:latest", + base := metrics.BaseMetric{ + Identifier: bdasIdentifier, + Summary: bdasSummary, + Container: bdasContainer, } - + launcher := metrics.LauncherWorker{BaseMetric: base} BDAS := BDAS{LauncherWorker: launcher} metrics.Register(&BDAS) } diff --git a/pkg/metrics/app/hpl.go b/pkg/metrics/app/hpl.go index 9e06ad3..ab9ce75 100644 --- a/pkg/metrics/app/hpl.go +++ b/pkg/metrics/app/hpl.go @@ -10,16 +10,23 @@ package application import ( "fmt" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" ) // https://www.netlib.org/benchmark/hpl/ // https://ulhpc-tutorials.readthedocs.io/en/production/parallel/mpi/HPL/ +const ( + hplIdentifier = "app-hpl" + hplSummary = "High-Performance Linpack (HPL)" + hplContainer = "ghcr.io/converged-computing/metric-hpl-spack:latest" +) + // Default input file hpl.dat // The output of this is Ns, memory is in GiB // -m 128 -NB 192 -r 0.3 -N 2: translates to --mem 128 -NB ${blocksize} -r 0.3 -N ${pods} @@ -59,7 +66,7 @@ ${mem_alignment} memory alignment in double (> 0) (4,8,16) ) type HPL struct { - jobs.LauncherWorker + metrics.LauncherWorker // Custom Options mpiargs string @@ -121,6 +128,10 @@ func (m *HPL) SetOptions(metric *api.Metric) { m.ResourceSpec = &metric.Resources m.AttributeSpec = &metric.Attributes + m.Identifier = hplIdentifier + m.Summary = hplSummary + m.Container = hplContainer + // Defaults for hpl.dat values. // memory and pods (nodes) calculated on the fly, unless otherwise provided m.ratio = "0.3" @@ -183,6 +194,10 @@ func (m *HPL) SetOptions(metric *api.Metric) { if ok { m.blocksize = value.IntVal } + value, ok = metric.Options["workdir"] + if ok { + m.Workdir = value.StrVal + } value, ok = metric.Options["row_or_colmajor_pmapping"] if ok { m.row_or_colmajor_pmapping = value.IntVal @@ -256,34 +271,31 @@ func (m HPL) Options() map[string]intstr.IntOrString { } } -// Return lookup of entrypoint scripts -func (m HPL) EntrypointScripts( +func (m HPL) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) + meta := metrics.Metadata(spec, metric) hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, "", hosts) + prefix := m.GetCommonPrefix(meta, "", hosts) // Memory command since could mess up templating memoryCmd := `awk '/MemFree/ { printf "%.3f \n", $2/1024/1024 }' /proc/meminfo` - // Template for the launcher - // TODO need to finish adding here when HPL rebuild done - template := ` + preBlock := ` # Source spack environment . /opt/spack-environment/activate.sh - + # Calculate memory, if not defined memory=%d if [[ $memory -eq 0 ]]; then memory=$(%s) fi - + echo "Memory is ${memory}" - + np=%d pods=%d # Tasks per node, not total @@ -291,20 +303,20 @@ tasks=$(nproc) if [[ $np -eq 0 ]]; then np=$(( $pods*$tasks )) fi - + echo "Number of tasks (nproc on one node) is $tasks" echo "Number of tasks total (across $pods nodes) is $np" - + blocksize=%d ratio=%s - + # This calculates the compute value - retrieved from tutorials in /opt/view/bin compute_script="compute_N -m ${memory} -NB ${blocksize} -r ${ratio} -N ${pods}" echo $compute_script # This is the size, variable "N" in the hpl.dat (not confusing or anything) size=$(${compute_script}) echo "Compute size is ${size}" - + # Define rest of envars we need for template row_or_colmajor_pmapping=%d pfact=%d @@ -318,24 +330,28 @@ swapping_threshold=%d L1_transposed=%d U_transposed=%d mem_alignment=%d - + # Write the input file (this parses environment variables too) cat < ./hpl.dat %s EOF - + cp ./hostlist.txt ./hostnames.txt rm ./hostlist.txt %s - + echo "%s" # This is in /root/hpl/bin/linux/xhpl -mpirun --allow-run-as-root --hostfile ./hostlist.txt -np $np %s xhpl +` + + postBlock := ` echo "%s" %s ` - launcherTemplate := prefix + fmt.Sprintf( - template, + command := fmt.Sprintf("mpirun --allow-run-as-root --hostfile ./hostlist.txt -np $np %s xhpl", m.mpiargs) + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + preBlock = prefix + fmt.Sprintf( + preBlock, m.memory, memoryCmd, m.tasks, @@ -356,24 +372,43 @@ echo "%s" m.memAlignment, inputData, metrics.TemplateConvertHostnames, - metrics.Separator, - m.mpiargs, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), + metadata.Separator, ) + postBlock = fmt.Sprintf(postBlock, metadata.CollectionEnd, interactive) + + // Entrypoint for the launcher + launcherEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.LauncherScript), + Path: m.LauncherScript, + Pre: preBlock, + Command: command, + Post: postBlock, + } + + // Entrypoint for the worker + workerEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.WorkerScript), + Path: m.WorkerScript, + Pre: prefix, + Command: "sleep infinity", + } + + // Container spec for the launcher + launcherContainer := m.GetLauncherContainerSpec(launcherEntrypoint) + workerContainer := m.GetWorkerContainerSpec(workerEntrypoint) + + // Return the script templates for each of launcher and worker + return []*specs.ContainerSpec{&launcherContainer, &workerContainer} - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) } func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-hpl", - Summary: "High-Performance Linpack (HPL)", - Container: "ghcr.io/converged-computing/metric-hpl-spack:latest", + base := metrics.BaseMetric{ + Identifier: hplIdentifier, + Summary: hplSummary, + Container: hplContainer, } - + launcher := metrics.LauncherWorker{BaseMetric: base} HPL := HPL{LauncherWorker: launcher} metrics.Register(&HPL) } diff --git a/pkg/metrics/app/kripke.go b/pkg/metrics/app/kripke.go index c9197a4..4906b6d 100644 --- a/pkg/metrics/app/kripke.go +++ b/pkg/metrics/app/kripke.go @@ -8,21 +8,20 @@ SPDX-License-Identifier: MIT package application import ( - "fmt" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" ) -type Kripke struct { - jobs.LauncherWorker +const ( + kripkeIdentifier = "app-kripke" + kripkeSummary = "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids" + kripkeContainer = "ghcr.io/converged-computing/metric-kripke:latest" +) - // Options - command string - prefix string +type Kripke struct { + metrics.LauncherWorker } func (m Kripke) Url() string { @@ -36,27 +35,16 @@ func (m Kripke) Family() string { // Set custom options / attributes for the metric func (m *Kripke) SetOptions(metric *api.Metric) { - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes + + m.Identifier = kripkeIdentifier + m.Summary = kripkeSummary + m.Container = kripkeContainer // Set user defined values or fall back to defaults - m.prefix = "mpirun --hostfile ./hostlist.txt" - m.command = "kripke" + m.Prefix = "mpirun --hostfile ./hostlist.txt" + m.Command = "kripke" m.Workdir = "/opt/kripke" - - // This could be improved :) - command, ok := metric.Options["command"] - if ok { - m.command = command.StrVal - } - workdir, ok := metric.Options["workdir"] - if ok { - m.Workdir = workdir.StrVal - } - mpirun, ok := metric.Options["mpirun"] - if ok { - m.prefix = mpirun.StrVal - } + m.SetDefaultOptions(metric) } // Validate that we can run Kripke @@ -67,8 +55,8 @@ func (n Kripke) Validate(spec *api.MetricSet) bool { // Exported options and list options func (m Kripke) Options() map[string]intstr.IntOrString { return map[string]intstr.IntOrString{ - "command": intstr.FromString(m.command), - "mpirun": intstr.FromString(m.prefix), + "command": intstr.FromString(m.Command), + "prefix": intstr.FromString(m.Prefix), "workdir": intstr.FromString(m.Workdir), } } @@ -76,45 +64,13 @@ func (n Kripke) ListOptions() map[string][]intstr.IntOrString { return map[string][]intstr.IntOrString{} } -// Return lookup of entrypoint scripts -func (m Kripke) EntrypointScripts( - spec *api.MetricSet, - metric *metrics.Metric, -) []metrics.EntrypointScript { - - // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) - hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, m.command, hosts) - - // Template for the launcher - template := ` -echo "%s" -%s ./problem.sh -echo "%s" -%s -` - launcherTemplate := prefix + fmt.Sprintf( - template, - metrics.Separator, - m.prefix, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) -} - func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-kripke", - Summary: "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids", - Container: "ghcr.io/converged-computing/metric-kripke:latest", - WorkerScript: "/metrics_operator/kripke-worker.sh", - LauncherScript: "/metrics_operator/kripke-launcher.sh", + base := metrics.BaseMetric{ + Identifier: kripkeIdentifier, + Summary: kripkeSummary, + Container: kripkeContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} kripke := Kripke{LauncherWorker: launcher} metrics.Register(&kripke) } diff --git a/pkg/metrics/app/laghos.go b/pkg/metrics/app/laghos.go index 87af2c2..e1562d7 100644 --- a/pkg/metrics/app/laghos.go +++ b/pkg/metrics/app/laghos.go @@ -8,21 +8,20 @@ SPDX-License-Identifier: MIT package application import ( - "fmt" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" ) -type Laghos struct { - jobs.LauncherWorker +const ( + laghosIdentifier = "app-laghos" + laghosSummary = "LAGrangian High-Order Solver" + laghosContainer = "ghcr.io/converged-computing/metric-laghos:latest" +) - // Custom Options - command string - prefix string +type Laghos struct { + metrics.LauncherWorker } // I think this is a simulation? @@ -36,77 +35,34 @@ func (m Laghos) Url() string { // Set custom options / attributes for the metric func (m *Laghos) SetOptions(metric *api.Metric) { - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes + + m.Identifier = laghosIdentifier + m.Summary = laghosSummary + m.Container = laghosContainer // Set user defined values or fall back to defaults - m.prefix = "/bin/bash" - m.command = "mpirun -np 4 --hostfile ./hostlist.txt ./laghos" + m.Prefix = "/bin/bash" + m.Command = "mpirun -np 4 --hostfile ./hostlist.txt ./laghos" m.Workdir = "/workflow/laghos" - - // This could be improved :) - command, ok := metric.Options["command"] - if ok { - m.command = command.StrVal - } - workdir, ok := metric.Options["workdir"] - if ok { - m.Workdir = workdir.StrVal - } - prefix, ok := metric.Options["prefix"] - if ok { - m.prefix = prefix.StrVal - } + m.SetDefaultOptions(metric) } // Exported options and list options func (m Laghos) Options() map[string]intstr.IntOrString { return map[string]intstr.IntOrString{ - "command": intstr.FromString(m.command), - "prefix": intstr.FromString(m.prefix), + "command": intstr.FromString(m.Command), + "prefix": intstr.FromString(m.Prefix), "workdir": intstr.FromString(m.Workdir), } } -// Return lookup of entrypoint scripts -func (m Laghos) EntrypointScripts( - spec *api.MetricSet, - metric *metrics.Metric, -) []metrics.EntrypointScript { - - // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) - hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, m.command, hosts) - - // Template for the launcher - // TODO need to finish adding here when laghos rebuild done - template := ` -echo "%s" -%s ./problem.sh -echo "%s" -%s -` - launcherTemplate := prefix + fmt.Sprintf( - template, - metrics.Separator, - m.prefix, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) -} - func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-laghos", - Summary: "LAGrangian High-Order Solver", - Container: "ghcr.io/converged-computing/metric-laghos:latest", + base := metrics.BaseMetric{ + Identifier: laghosIdentifier, + Summary: laghosSummary, + Container: laghosContainer, } - + launcher := metrics.LauncherWorker{BaseMetric: base} Laghos := Laghos{LauncherWorker: launcher} metrics.Register(&Laghos) } diff --git a/pkg/metrics/app/lammps.go b/pkg/metrics/app/lammps.go index 5ac69b3..2c7849e 100644 --- a/pkg/metrics/app/lammps.go +++ b/pkg/metrics/app/lammps.go @@ -10,18 +10,22 @@ package application import ( "fmt" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" ) -type Lammps struct { - jobs.LauncherWorker +const ( + lammpsIdentifier = "app-lammps" + lammpsSummary = "LAMMPS molecular dynamic simulation" + lammpsContainer = "ghcr.io/converged-computing/metric-lammps:latest" +) - // Options - command string +type Lammps struct { + metrics.LauncherWorker } func (m Lammps) Url() string { @@ -35,83 +39,96 @@ func (m Lammps) Family() string { // Set custom options / attributes for the metric func (m *Lammps) SetOptions(metric *api.Metric) { - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes + + // Default metric options, these are overridden when we reflect + m.Identifier = lammpsIdentifier + m.Summary = lammpsSummary + m.Container = lammpsContainer // Set user defined values or fall back to defaults // This is a more manual approach that puts the user in charge of determining the entire command // This more closely matches what we might do on HPC :) - m.command = "mpirun --hostfile ./hostlist.txt -np 2 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite" + m.Command = "mpirun --hostfile ./hostlist.txt -np 2 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite" m.Workdir = "/opt/lammps/examples/reaxff/HNS" - - // This could be improved :) - command, ok := metric.Options["command"] - if ok { - m.command = command.StrVal - } - workdir, ok := metric.Options["workdir"] - if ok { - m.Workdir = workdir.StrVal - } + m.SetDefaultOptions(metric) } -// Validate that we can run Lammps -func (n Lammps) Validate(spec *api.MetricSet) bool { - return spec.Spec.Pods >= 2 +// LAMMPS can be run on one node +func (m Lammps) Validate(spec *api.MetricSet) bool { + return true } // Exported options and list options func (m Lammps) Options() map[string]intstr.IntOrString { - return map[string]intstr.IntOrString{ - "command": intstr.FromString(m.command), - "workdir": intstr.FromString(m.Workdir), + values := map[string]intstr.IntOrString{ + "command": intstr.FromString(m.Command), + "workdir": intstr.FromString(m.Workdir), + "soleTenacy": intstr.FromString("false"), } -} -func (n Lammps) ListOptions() map[string][]intstr.IntOrString { - return map[string][]intstr.IntOrString{} + if m.SoleTenancy { + values["soleTenancy"] = intstr.FromString("true") + } + return values } -// Return lookup of entrypoint scripts -func (m Lammps) EntrypointScripts( +// Prepare containers with jobs and entrypoint scripts +func (m Lammps) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) + meta := metrics.Metadata(spec, metric) hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, m.command, hosts) + prefix := m.GetCommonPrefix(meta, m.Command, hosts) - // Template for the launcher - template := ` + // Template blocks for launcher script + preBlock := ` echo "%s" -%s +` + + postBlock := ` echo "%s" %s ` - launcherTemplate := prefix + fmt.Sprintf( - template, - metrics.Separator, - m.command, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + preBlock = prefix + fmt.Sprintf(preBlock, metadata.Separator) + postBlock = fmt.Sprintf(postBlock, metadata.CollectionEnd, interactive) + + // Entrypoint for the launcher + launcherEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.LauncherScript), + Path: m.LauncherScript, + Pre: preBlock, + Command: m.Command, + Post: postBlock, + } + + // Entrypoint for the worker + // Just has a sleep infinity added to the prefix + workerEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.WorkerScript), + Path: m.WorkerScript, + Pre: prefix, + Command: "sleep infinity", + } - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" + // These are associated with replicated jobs via JobName + launcherContainer := m.GetLauncherContainerSpec(launcherEntrypoint) + workerContainer := m.GetWorkerContainerSpec(workerEntrypoint) // Return the script templates for each of launcher and worker - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) + return []*specs.ContainerSpec{&launcherContainer, &workerContainer} } +// TODO can we have a new function instead? func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-lammps", - Summary: "LAMMPS molecular dynamic simulation", - Container: "ghcr.io/converged-computing/metric-lammps:latest", - WorkerScript: "/metrics_operator/lammps-worker.sh", - LauncherScript: "/metrics_operator/lammps-launcher.sh", + base := metrics.BaseMetric{ + Identifier: lammpsIdentifier, + Summary: lammpsSummary, + Container: lammpsContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} lammps := Lammps{LauncherWorker: launcher} metrics.Register(&lammps) } diff --git a/pkg/metrics/app/ldms.go b/pkg/metrics/app/ldms.go index d08d2e6..f6534d2 100644 --- a/pkg/metrics/app/ldms.go +++ b/pkg/metrics/app/ldms.go @@ -10,15 +10,22 @@ package application import ( "fmt" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" +) + +const ( + ldmsIdentifier = "app-ldms" + ldmsSummary = "provides LDMS, a low-overhead, low-latency framework for collecting, transferring, and storing metric data on a large distributed computer system." + ldmsContainer = "ghcr.io/converged-computing/metric-ovis-hpc:latest" ) type LDMS struct { - jobs.SingleApplication + metrics.SingleApplication // Custom Options completions int32 @@ -39,6 +46,10 @@ func (m LDMS) Url() string { func (m *LDMS) SetOptions(metric *api.Metric) { m.ResourceSpec = &metric.Resources m.AttributeSpec = &metric.Attributes + + m.Identifier = ldmsIdentifier + m.Container = ldmsContainer + m.Summary = ldmsSummary m.rate = 10 // Set user defined values or fall back to defaults @@ -61,6 +72,8 @@ func (m *LDMS) SetOptions(metric *api.Metric) { if ok { m.rate = rate.IntVal } + // Primarily sole tenancy + m.SetDefaultOptions(metric) } // Exported options and list options @@ -76,17 +89,15 @@ func (n LDMS) ListOptions() map[string][]intstr.IntOrString { return map[string][]intstr.IntOrString{} } -// Return lookup of entrypoint scripts -func (m LDMS) EntrypointScripts( +func (m LDMS) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) + meta := metrics.Metadata(spec, metric) - // Template for the launcher - template := ` + preBlock := ` # Setup munge mkdir -p /run/munge chown -R 0 /var/log/munge /var/lib/munge /etc/munge /run/munge @@ -94,52 +105,54 @@ chown -R 0 /var/log/munge /var/lib/munge /etc/munge /run/munge # ldmsd -x sock:10444 -c /opt/sampler.conf -l /tmp/demo_ldmsd_log -v DEBUG -a munge -r $(pwd)/ldmsd.pid ldmsd -x sock:10444 -c /opt/sampler.conf -l /tmp/demo_ldmsd_log -v DEBUG -r $(pwd)/ldmsd.pid echo "%s" - + i=0 completions=%d echo "%s" while true do - echo "%s" - %s - if [[ $retval -ne 0 ]]; then - echo "%s" - exit 0 - fi - if [[ $completions -ne 0 ]] && [[ $i -eq $completions ]]; then - echo "%s" - exit 0 - fi - sleep %d - let i=i+1 + echo "%s" + %s + if [[ $retval -ne 0 ]]; then + echo "%s" + exit 0 + fi + if [[ $completions -ne 0 ]] && [[ $i -eq $completions ]]; then + echo "%s" + exit 0 + fi + sleep %d + let i=i+1 done +` + + postBlock := ` echo "%s" %s ` - script := fmt.Sprintf( - template, - metadata, + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + preBlock = fmt.Sprintf( + preBlock, + meta, m.completions, - metrics.CollectionStart, - metrics.Separator, + metadata.CollectionStart, + metadata.Separator, m.command, - metrics.CollectionEnd, - metrics.CollectionEnd, + metadata.CollectionEnd, + metadata.CollectionEnd, m.rate, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), ) - return []metrics.EntrypointScript{ - {Script: script}, - } + postBlock = fmt.Sprintf(postBlock, metadata.CollectionEnd, interactive) + return m.ApplicationContainerSpec(preBlock, "", postBlock) } func init() { - app := jobs.SingleApplication{ - Identifier: "app-ldms", - Summary: "provides LDMS, a low-overhead, low-latency framework for collecting, transferring, and storing metric data on a large distributed computer system.", - Container: "ghcr.io/converged-computing/metric-ovis-hpc:latest", + base := metrics.BaseMetric{ + Identifier: ldmsIdentifier, + Summary: ldmsSummary, + Container: ldmsContainer, } - LDMS := LDMS{SingleApplication: app} + single := metrics.SingleApplication{BaseMetric: base} + LDMS := LDMS{SingleApplication: single} metrics.Register(&LDMS) } diff --git a/pkg/metrics/app/nekbone.go b/pkg/metrics/app/nekbone.go index f4cfef3..21e7d34 100644 --- a/pkg/metrics/app/nekbone.go +++ b/pkg/metrics/app/nekbone.go @@ -8,21 +8,20 @@ SPDX-License-Identifier: MIT package application import ( - "fmt" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" ) -type Nekbone struct { - jobs.LauncherWorker +const ( + nekboneIdentifier = "app-nekbone" + nekboneSummary = "A mini-app derived from the Nek5000 CFD code which is a high order, incompressible Navier-Stokes CFD solver based on the spectral element method. The conjugate gradiant solve is compute intense, contains small messages and frequent allreduces." + nekboneContainer = "ghcr.io/converged-computing/metric-nekbone:latest" +) - // Custom Options - command string - prefix string +type Nekbone struct { + metrics.LauncherWorker } // I think this is a simulation? @@ -36,78 +35,32 @@ func (m Nekbone) Url() string { // Set custom options / attributes for the metric func (m *Nekbone) SetOptions(metric *api.Metric) { - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes - // Set user defined values or fall back to defaults - m.prefix = "/bin/bash" - m.command = "mpiexec --hostfile ./hostlist.txt -np 2 ./nekbone" + m.Identifier = nekboneIdentifier + m.Summary = nekboneSummary + m.Container = nekboneContainer + m.Prefix = "/bin/bash" + m.Command = "mpiexec --hostfile ./hostlist.txt -np 2 ./nekbone" m.Workdir = "/root/nekbone-3.0/test/example2" - - // This could be improved :) - command, ok := metric.Options["command"] - if ok { - m.command = command.StrVal - } - workdir, ok := metric.Options["workdir"] - if ok { - m.Workdir = workdir.StrVal - } - prefix, ok := metric.Options["prefix"] - if ok { - m.prefix = prefix.StrVal - } + m.SetDefaultOptions(metric) } // Exported options and list options func (m Nekbone) Options() map[string]intstr.IntOrString { return map[string]intstr.IntOrString{ - "command": intstr.FromString(m.command), - "prefix": intstr.FromString(m.prefix), + "command": intstr.FromString(m.Command), + "prefix": intstr.FromString(m.Prefix), "workdir": intstr.FromString(m.Workdir), } } -func (n Nekbone) ListOptions() map[string][]intstr.IntOrString { - return map[string][]intstr.IntOrString{} -} - -// Return lookup of entrypoint scripts -func (m Nekbone) EntrypointScripts( - spec *api.MetricSet, - metric *metrics.Metric, -) []metrics.EntrypointScript { - - // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) - hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, m.command, hosts) - - // Template for the launcher - template := ` -echo "%s" -%s ./problem.sh -echo "%s" -%s -` - launcherTemplate := prefix + fmt.Sprintf( - template, - metrics.Separator, - m.prefix, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) -} func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-nekbone", - Summary: "A mini-app derived from the Nek5000 CFD code which is a high order, incompressible Navier-Stokes CFD solver based on the spectral element method. The conjugate gradiant solve is compute intense, contains small messages and frequent allreduces.", - Container: "ghcr.io/converged-computing/metric-nekbone:latest", + base := metrics.BaseMetric{ + Identifier: nekboneIdentifier, + Summary: nekboneSummary, + Container: nekboneContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} Nekbone := Nekbone{LauncherWorker: launcher} metrics.Register(&Nekbone) } diff --git a/pkg/metrics/app/pennant.go b/pkg/metrics/app/pennant.go index 5ba52e8..9f716ba 100644 --- a/pkg/metrics/app/pennant.go +++ b/pkg/metrics/app/pennant.go @@ -8,21 +8,20 @@ SPDX-License-Identifier: MIT package application import ( - "fmt" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" ) -type Pennant struct { - jobs.LauncherWorker +const ( + pennantIdentifier = "app-pennant" + pennantSummary = "Unstructured mesh hydrodynamics for advanced architectures " + pennantContainer = "ghcr.io/converged-computing/metric-pennant:latest" +) - // Custom Options - command string - prefix string +type Pennant struct { + metrics.LauncherWorker } // I think this is a simulation? @@ -36,80 +35,34 @@ func (m Pennant) Url() string { // Set custom options / attributes for the metric func (m *Pennant) SetOptions(metric *api.Metric) { - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes + + m.Container = pennantContainer + m.Identifier = pennantIdentifier + m.Summary = pennantSummary // Set user defined values or fall back to defaults - m.prefix = "mpirun --hostfile ./hostlist.txt" - m.command = "pennant /opt/pennant/test/sedovsmall/sedovsmall.pnt" + m.Prefix = "mpirun --hostfile ./hostlist.txt" + m.Command = "pennant /opt/pennant/test/sedovsmall/sedovsmall.pnt" m.Workdir = "/opt/pennant/test" - - // This could be improved :) - command, ok := metric.Options["command"] - if ok { - m.command = command.StrVal - } - workdir, ok := metric.Options["workdir"] - if ok { - m.Workdir = workdir.StrVal - } - mpirun, ok := metric.Options["mpirun"] - if ok { - m.prefix = mpirun.StrVal - } + m.SetDefaultOptions(metric) } // Exported options and list options func (m Pennant) Options() map[string]intstr.IntOrString { return map[string]intstr.IntOrString{ - "command": intstr.FromString(m.command), - "mpirun": intstr.FromString(m.prefix), + "command": intstr.FromString(m.Command), + "prefix": intstr.FromString(m.Prefix), "workdir": intstr.FromString(m.Workdir), } } -func (n Pennant) ListOptions() map[string][]intstr.IntOrString { - return map[string][]intstr.IntOrString{} -} - -// Return lookup of entrypoint scripts -func (m Pennant) EntrypointScripts( - spec *api.MetricSet, - metric *metrics.Metric, -) []metrics.EntrypointScript { - - // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) - hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, m.command, hosts) - - // Template for the launcher - template := ` -echo "%s" -%s ./problem.sh -echo "%s" -%s -` - launcherTemplate := prefix + fmt.Sprintf( - template, - metrics.Separator, - m.prefix, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) -} func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-pennant", - Summary: "Unstructured mesh hydrodynamics for advanced architectures ", - Container: "ghcr.io/converged-computing/metric-pennant:latest", - WorkerScript: "/metrics_operator/pennant-worker.sh", - LauncherScript: "/metrics_operator/pennant-launcher.sh", + base := metrics.BaseMetric{ + Identifier: pennantIdentifier, + Summary: pennantSummary, + Container: pennantContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} Pennant := Pennant{LauncherWorker: launcher} metrics.Register(&Pennant) } diff --git a/pkg/metrics/app/quicksilver.go b/pkg/metrics/app/quicksilver.go index 59195ba..56477eb 100644 --- a/pkg/metrics/app/quicksilver.go +++ b/pkg/metrics/app/quicksilver.go @@ -8,21 +8,20 @@ SPDX-License-Identifier: MIT package application import ( - "fmt" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" ) -type Quicksilver struct { - jobs.LauncherWorker +const ( + qsIdentifier = "app-quicksilver" + qsSummary = "A proxy app for the Monte Carlo Transport Code" + qsContainer = "ghcr.io/converged-computing/metric-quicksilver:latest" +) - // Custom Options - command string - mpirun string +type Quicksilver struct { + metrics.LauncherWorker } // I think this is a simulation? @@ -36,80 +35,34 @@ func (m Quicksilver) Url() string { // Set custom options / attributes for the metric func (m *Quicksilver) SetOptions(metric *api.Metric) { - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes + + m.Identifier = qsIdentifier + m.Summary = qsSummary + m.Container = qsContainer // Set user defined values or fall back to defaults - m.mpirun = "mpirun --hostfile ./hostlist.txt" - m.command = "qs /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp" + m.Prefix = "mpirun --hostfile ./hostlist.txt" + m.Command = "qs /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp" m.Workdir = "/opt/quicksilver/Examples" - - // This could be improved :) - command, ok := metric.Options["command"] - if ok { - m.command = command.StrVal - } - workdir, ok := metric.Options["workdir"] - if ok { - m.Workdir = workdir.StrVal - } - mpirun, ok := metric.Options["mpirun"] - if ok { - m.mpirun = mpirun.StrVal - } + m.SetDefaultOptions(metric) } // Exported options and list options func (m Quicksilver) Options() map[string]intstr.IntOrString { return map[string]intstr.IntOrString{ - "command": intstr.FromString(m.command), - "mpirun": intstr.FromString(m.mpirun), + "command": intstr.FromString(m.Command), + "prefix": intstr.FromString(m.Prefix), "workdir": intstr.FromString(m.Workdir), } } -func (n Quicksilver) ListOptions() map[string][]intstr.IntOrString { - return map[string][]intstr.IntOrString{} -} - -// Return lookup of entrypoint scripts -func (m Quicksilver) EntrypointScripts( - spec *api.MetricSet, - metric *metrics.Metric, -) []metrics.EntrypointScript { - - // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) - hosts := m.GetHostlist(spec) - prefix := m.GetCommonPrefix(metadata, m.command, hosts) - - // Template for the launcher - template := ` -echo "%s" -%s ./problem.sh -echo "%s" -%s -` - launcherTemplate := prefix + fmt.Sprintf( - template, - metrics.Separator, - m.mpirun, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) -} func init() { - launcher := jobs.LauncherWorker{ - Identifier: "app-quicksilver", - Summary: "A proxy app for the Monte Carlo Transport Code", - Container: "ghcr.io/converged-computing/metric-quicksilver:latest", - WorkerScript: "/metrics_operator/quicksilver-worker.sh", - LauncherScript: "/metrics_operator/quicksilver-launcher.sh", + base := metrics.BaseMetric{ + Identifier: qsIdentifier, + Summary: qsSummary, + Container: qsContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} Quicksilver := Quicksilver{LauncherWorker: launcher} metrics.Register(&Quicksilver) } diff --git a/pkg/metrics/application.go b/pkg/metrics/application.go index d0a4853..1b9997a 100644 --- a/pkg/metrics/application.go +++ b/pkg/metrics/application.go @@ -8,113 +8,68 @@ SPDX-License-Identifier: MIT package metrics import ( - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - corev1 "k8s.io/api/core/v1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/specs" jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" ) +// These are common templates for application metrics var ( - DefaultApplicationEntrypoint = "/metrics_operator/application-0.sh" - DefaultApplicationName = "application-0" - makeExecutable = int32(0777) + DefaultEntrypointScript = "/metrics_operator/entrypoint-0.sh" ) -// Get ReplicatedJobs intended to run a performance metric for an application -// For this setup, we expect to create a container for each metric -func (m *ApplicationMetricSet) ReplicatedJobs(spec *api.MetricSet) ([]jobset.ReplicatedJob, error) { - rjs := []jobset.ReplicatedJob{} - - // If no application volumes defined, need to init here - appVols := spec.Spec.Application.Volumes - if len(appVols) == 0 { - appVols = map[string]api.Volume{} - } - for _, metric := range m.Metrics() { - jobs, err := GetApplicationReplicatedJobs(spec, metric, appVols, true) - if err != nil { - return rjs, err - } - rjs = append(rjs, jobs...) - } - return rjs, nil +// SingleApplication is a Metric base for a simple application metric +// be accessible by other packages (and not conflict with function names) +type SingleApplication struct { + BaseMetric } -// Create a standalone JobSet, one without volumes or application -// This will be definition be a JobSet for only one metric -func GetApplicationReplicatedJobs( - spec *api.MetricSet, - metric *Metric, - volumes map[string]api.Volume, - shareProcessNamespace bool, -) ([]jobset.ReplicatedJob, error) { - - // Prepare a replicated job - rjs := []jobset.ReplicatedJob{} - - // We currently don't expose applications to allow custom replicated jobs - // If we return no replicated jobs, fall back to default - m := (*metric) - - // This defaults to one replicated job, named "m", no custom replicated job name, and sole tenancy false - job, err := GetReplicatedJob(spec, shareProcessNamespace, spec.Spec.Pods, spec.Spec.Completions, "", m.HasSoleTenancy()) - if err != nil { - return rjs, err - } - - // Add metric volumes to the list! This is usually for sharing metric assets with the application - // as an empty volume. Note that we do not check for overlapping keys - up to user. - // It is the responsibility of the metric to determine the mount location and entrypoint additions - metricVolumes := m.GetVolumes() - for k, v := range metricVolumes { - volumes[k] = v - } +func (m SingleApplication) HasSoleTenancy() bool { + return false +} - // Add volumes expecting an application. GetVolumes creates metric entrypoint volumes - // and adds existing volumes (application) to our set of mounts. We need both - // for the jobset. - runnerScripts := GetMetricsKeyToPath([]*Metric{metric}) +// Default SingleApplication is generic performance family +func (m SingleApplication) Family() string { + return PerformanceFamily +} - // Add the application entrypoint - appScript := corev1.KeyToPath{ - Key: DefaultApplicationName, - Path: DefaultApplicationName + ".sh", - Mode: &makeExecutable, +func (m *SingleApplication) ApplicationContainerSpec( + preBlock string, + command string, + postBlock string, +) []*specs.ContainerSpec { + + entrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(DefaultEntrypointScript), + Path: DefaultEntrypointScript, + Pre: preBlock, + Command: command, + Post: postBlock, } - runnerScripts = append(runnerScripts, appScript) - // Each metric has an entrypoint script - job.Template.Spec.Template.Spec.Volumes = GetVolumes(spec, runnerScripts, volumes) + return []*specs.ContainerSpec{{ + JobName: ReplicatedJobName, + Image: m.Image(), + Name: "app", + WorkingDir: m.Workdir, + EntrypointScript: entrypoint, + Resources: m.ResourceSpec, + Attributes: m.AttributeSpec, + }} - // Derive the containers for the metric - containerSpec := ContainerSpec{ - Image: m.Image(), - Command: []string{"/bin/bash", "/metrics_operator/entrypoint-0.sh"}, - WorkingDir: m.WorkingDir(), - Name: m.Name(), - Resources: m.Resources(), - Attributes: m.Attributes(), - } - - // This is for the metric and application containers - // Metric containers have metric entrypoint volumes - // Application containers have existing volumes - containers, err := GetContainers( - spec, - []ContainerSpec{containerSpec}, - volumes, +} - // Allow ptrace - true, +// Replicated Jobs are custom for a launcher worker +func (m *SingleApplication) ReplicatedJobs(spec *api.MetricSet) ([]*jobset.ReplicatedJob, error) { - // Allow sysadmin - true, - ) + js := []*jobset.ReplicatedJob{} + // Generate a replicated job for the applicatino + // An empty jobname will default to "m" the ReplicatedJobName provided by the operator + rj, err := AssembleReplicatedJob(spec, true, spec.Spec.Pods, spec.Spec.Pods, "", m.SoleTenancy) if err != nil { - logger.Errorf("There was an error getting containers for %s: %s\n", m.Name(), err) - return rjs, err + return js, err } - job.Template.Spec.Template.Spec.Containers = containers - rjs = append(rjs, *job) - return rjs, nil + js = []*jobset.ReplicatedJob{rj} + return js, nil } diff --git a/pkg/metrics/base.go b/pkg/metrics/base.go new file mode 100644 index 0000000..5e1ff4b --- /dev/null +++ b/pkg/metrics/base.go @@ -0,0 +1,201 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package metrics + +import ( + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/addons" + "github.com/converged-computing/metrics-operator/pkg/specs" + "k8s.io/apimachinery/pkg/util/intstr" + jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" +) + +// BaseMetric provides shared attributes across Metric types +type BaseMetric struct { + Identifier string + Summary string + Container string + Workdir string + + // A custom container can be used to replace the application + // (typically advanced users only) + CustomContainer string + ResourceSpec *api.ContainerResources + AttributeSpec *api.ContainerSpec + + // If we ask for sole tenancy, we assign 1 pod / hostname + SoleTenancy bool + + // A metric can have one or more addons + Addons map[string]*addons.Addon +} + +// RegisterAddon adds an addon to the set, assuming it's already validated +func (m *BaseMetric) RegisterAddon(addon *addons.Addon) { + a := (*addon) + if m.Addons == nil { + m.Addons = map[string]*addons.Addon{} + } + logger.Infof("🟧️ Registering addon %s", a) + m.Addons[a.Name()] = addon +} + +// Name returns the metric name +func (m BaseMetric) Name() string { + return m.Identifier +} + +// Set a custom container +func (m *BaseMetric) SetContainer(container string) { + m.Container = container +} + +// Description returns the metric description +func (m BaseMetric) Description() string { + return m.Summary +} + +// Container +func (m *BaseMetric) Image() string { + return m.Container +} + +// Return container resources for the metric container +func (m BaseMetric) Resources() *api.ContainerResources { + return m.ResourceSpec +} +func (m BaseMetric) Attributes() *api.ContainerSpec { + return m.AttributeSpec +} + +// Validation +func (m BaseMetric) Validate(set *api.MetricSet) bool { + if m.Identifier == "" { + logger.Errorf("Metric %s is missing an identifier.\n", m) + return false + } + return true +} + +func (m BaseMetric) ListOptions() map[string][]intstr.IntOrString { + return map[string][]intstr.IntOrString{} +} + +// Jobs required for success condition (n is the netmark run) +func (m BaseMetric) SuccessJobs() []string { + return []string{} +} + +func (m BaseMetric) HasSoleTenancy() bool { + return m.SoleTenancy +} + +// Default replicated jobs will generate for N pods, with no shared process namespace (e.g., storage) +func (m *BaseMetric) ReplicatedJobs(spec *api.MetricSet) ([]*jobset.ReplicatedJob, error) { + + js := []*jobset.ReplicatedJob{} + + // An empty jobname will default to "m" the ReplicatedJobName provided by the operator + rj, err := AssembleReplicatedJob(spec, false, spec.Spec.Pods, spec.Spec.Pods, "", m.SoleTenancy) + if err != nil { + return js, err + } + js = []*jobset.ReplicatedJob{rj} + return js, nil +} + +// SetDefaultOptions that are shared (possibly) +func (m BaseMetric) SetDefaultOptions(metric *api.Metric) { + st, ok := metric.Options["soleTenancy"] + if ok && st.StrVal == "false" || st.StrVal == "no" { + m.SoleTenancy = false + } + if ok && st.StrVal == "true" || st.StrVal == "yes" { + m.SoleTenancy = true + } +} + +// Add registered addons to replicated jobs +// Container specs returned are assumed to be config maps that need to be written +func (m BaseMetric) AddAddons( + spec *api.MetricSet, + rjs []*jobset.ReplicatedJob, + + // These container specs include all replicated jobs + containerSpecs []*specs.ContainerSpec, +) ([]*specs.ContainerSpec, error) { + + // VolumeMounts can be generated from container specs + // For each addon, do custom logic depending on the type + // These are the main set of volumes, containers we are going to add + // Organize volumes by unique name + volumes := []specs.VolumeSpec{} + + // These are addon container specs + addonContainers := []specs.ContainerSpec{} + + // These are container specs that need to be written to configmaps + cms := []*specs.ContainerSpec{} + + logger.Infof("🟧️ Addons to include %s\n", m.Addons) + for _, addon := range m.Addons { + a := (*addon) + + volumes = append(volumes, a.AssembleVolumes()...) + + // Assemble containers that addons provide, also as specs + assembleContainers := a.AssembleContainers() + for _, assembleContainer := range assembleContainers { + + // Any container specs that need to be created later as config maps are kept in cms + if assembleContainer.NeedsWrite { + cms = append(cms, &assembleContainer) + } + addonContainers = append(addonContainers, assembleContainer) + } + + // Allow the addons to customize the container entrypoints, specific to the job name + // It's important that this set does not include other addon container specs + a.CustomizeEntrypoints(containerSpecs, rjs) + } + + // There is a bug here showing lots of nil but I don't know why + logger.Infof("🟧️ Volumes that are going to be added %s\n", volumes) + + // Add containers to the replicated job (filtered based on matching names) + containers := addonContainers + for _, cs := range containerSpecs { + containers = append(containers, (*cs)) + } + + // Generate actual containers and volumes for each replicated job + for _, rj := range rjs { + + // We also include the addon volumes, which generally need mount points + rjContainers, err := getReplicatedJobContainers(spec, rj, containers, volumes) + if err != nil { + return cms, err + } + rj.Template.Spec.Template.Spec.Containers = rjContainers + + // And volumes! + // containerSpecs are used to generate our metric entrypoint volumes + // volumes indicate existing volumes + rj.Template.Spec.Template.Spec.Volumes = getReplicatedJobVolumes(spec, containerSpecs, volumes) + } + return cms, nil +} + +// Addons returns a list of addons, removing them from the key value lookup +func (m BaseMetric) GetAddons() []*addons.Addon { + addons := []*addons.Addon{} + for _, addon := range m.Addons { + addons = append(addons, addon) + } + return addons +} diff --git a/pkg/metrics/containers.go b/pkg/metrics/containers.go index dee1a41..d4ad070 100644 --- a/pkg/metrics/containers.go +++ b/pkg/metrics/containers.go @@ -8,11 +8,11 @@ SPDX-License-Identifier: MIT package metrics import ( - "fmt" - corev1 "k8s.io/api/core/v1" + jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/specs" ) // Security context defaults @@ -21,66 +21,20 @@ var ( capPtrace = corev1.Capability("SYS_PTRACE") ) -// A ContainerSpec is used by a metric to define a container -type ContainerSpec struct { - Command []string - Image string - Name string - WorkingDir string - Resources *api.ContainerResources - Attributes *api.ContainerSpec -} - -// Named entrypoint script for a container -type EntrypointScript struct { - Name string - Path string - Script string -} - -// getContainers gets containers for a set of metrics -func getContainers( +// getReplicatedJobContainers gets containers for the replicated job +// also generating needed mounts, etc. +func getReplicatedJobContainers( set *api.MetricSet, - metrics []*Metric, - volumes map[string]api.Volume, + rj *jobset.ReplicatedJob, + containerSpecs []specs.ContainerSpec, + volumes []specs.VolumeSpec, ) ([]corev1.Container, error) { - containers := []ContainerSpec{} - - // Create one container per metric! - // Each needs to have the sys trace capability to see the application pids - for i, m := range metrics { - - metric := (*m) - script := fmt.Sprintf("/metrics_operator/entrypoint-%d.sh", i) - command := []string{"/bin/bash", script} - - newContainer := ContainerSpec{ - Command: command, - Image: metric.Image(), - WorkingDir: metric.WorkingDir(), - Name: metric.Name(), - Resources: metric.Resources(), - Attributes: metric.Attributes(), - } - containers = append(containers, newContainer) - } - return GetContainers(set, containers, volumes, false, false) -} - -// GetContainers based on one or more container specs -func GetContainers( - set *api.MetricSet, - specs []ContainerSpec, - volumes map[string]api.Volume, - allowPtrace bool, - allowAdmin bool, -) ([]corev1.Container, error) { + // We only generate containers from specs that match the replicated job name + containers := []corev1.Container{} - // Assume we can pull once for now, this could be changed to allow - // corev2.PullAlways + // Assume we can pull once for now, this could be changed to allow pull always pullPolicy := corev1.PullIfNotPresent - containers := []corev1.Container{} // Currently we share the same mounts across containers, makes life easier! mounts := getVolumeMounts(set, volumes) @@ -89,29 +43,34 @@ func GetContainers( hasPrivileged := false // Each needs to have the sys trace capability to see the application pids - for _, s := range specs { + for _, cs := range containerSpecs { - hasPrivileged = hasPrivileged || s.Attributes.SecurityContext.Privileged - - // Get resources for container - resources, err := getContainerResources(s.Resources) - logger.Info("🌀 Metric", "Container.Resources", resources) + // Skip containers not intended for the replicated job + if cs.JobName != "" && cs.JobName != rj.Name { + continue + } + hasPrivileged = hasPrivileged || cs.Attributes.SecurityContext.Privileged + resources, err := getContainerResources(cs.Resources) if err != nil { return containers, err } - // Create one container per metric! - // Name the container by the metric for now + // If a command is provided, use it first + command := []string{"/bin/bash", cs.EntrypointScript.Path} + if len(cs.Command) > 0 { + command = cs.Command + } + // Create the actual container from the spec newContainer := corev1.Container{ - Name: s.Name, - Image: s.Image, + Name: cs.Name, + Image: cs.Image, ImagePullPolicy: pullPolicy, VolumeMounts: mounts, Stdin: true, TTY: true, - Command: s.Command, + Command: command, SecurityContext: &corev1.SecurityContext{ - Privileged: &s.Attributes.SecurityContext.Privileged, + Privileged: &cs.Attributes.SecurityContext.Privileged, }, } @@ -119,21 +78,20 @@ func GetContainers( caps := []corev1.Capability{} // Should we allow sharing the process namespace? - if allowPtrace { + if cs.Attributes.SecurityContext.AllowPtrace { caps = append(caps, capPtrace) } - if allowAdmin { + if cs.Attributes.SecurityContext.AllowAdmin { caps = append(caps, capAdmin) } newContainer.SecurityContext.Capabilities = &corev1.Capabilities{Add: caps} // Only add the working directory if it's defined - if s.WorkingDir != "" { - newContainer.WorkingDir = s.WorkingDir + if cs.WorkingDir != "" { + newContainer.WorkingDir = cs.WorkingDir } - // Ports and environment - // TODO this should be added when needed + // Ports and environment (add when needed) ports := []corev1.ContainerPort{} envars := []corev1.EnvVar{} newContainer.Ports = ports @@ -141,45 +99,6 @@ func GetContainers( newContainer.Resources = resources containers = append(containers, newContainer) } - - // If our metric set has an application, add it last - // We currently accept resources for an application (but not metrics yet) - if set.HasApplication() { - - // Prepare container resources - resources, err := getContainerResources(&set.Spec.Application.Resources) - logger.Info("🌀 Application", "Container.Resources", resources) - if err != nil { - return containers, err - } - - // The application security context can have admin (but should not have the same process sharing) - securityContext := &corev1.SecurityContext{} - if allowAdmin { - securityContext.Capabilities = &corev1.Capabilities{ - Add: []corev1.Capability{capAdmin}, - } - securityContext.Privileged = &hasPrivileged - } - - // Minimally this is set.Spec.Application.Entrypoint executed in a bash script - // But for an application metric with a volume, there can be custom logic - command := []string{"/bin/bash", DefaultApplicationEntrypoint} - appContainer := corev1.Container{ - Name: "app", - Image: set.Spec.Application.Image, - ImagePullPolicy: pullPolicy, - VolumeMounts: mounts, - Stdin: true, - TTY: true, - Command: command, - SecurityContext: securityContext, - } - if set.Spec.Application.WorkingDir != "" { - appContainer.WorkingDir = set.Spec.Application.WorkingDir - } - containers = append(containers, appContainer) - } logger.Infof("🟪️ Adding %d containers\n", len(containers)) return containers, nil } diff --git a/pkg/metrics/io/fio.go b/pkg/metrics/io/fio.go index d14ff2d..37854b5 100644 --- a/pkg/metrics/io/fio.go +++ b/pkg/metrics/io/fio.go @@ -10,18 +10,25 @@ package io import ( "fmt" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" ) // FIO means Flexible IO // https://docs.gitlab.com/ee/administration/operations/filesystem_benchmarking.html +const ( + fioIdentifier = "io-fio" + fioSummary = "Flexible IO Tester (FIO)" + fioContainer = "ghcr.io/converged-computing/metric-fio:latest" +) + type Fio struct { - jobs.StorageGeneric + metrics.StorageGeneric // Options testname string @@ -29,6 +36,14 @@ type Fio struct { iodepth int size string directory string + + // Or just define the entire command + command string + + // extra commands for pre, post, etc. + pre string + post string + prefix string } func (m Fio) Url() string { @@ -40,6 +55,10 @@ func (m *Fio) SetOptions(metric *api.Metric) { m.ResourceSpec = &metric.Resources m.AttributeSpec = &metric.Attributes + m.Identifier = fioIdentifier + m.Summary = fioSummary + m.Container = fioContainer + // Set defaults for options m.testname = "test" m.blocksize = "4k" @@ -51,6 +70,10 @@ func (m *Fio) SetOptions(metric *api.Metric) { if ok { m.testname = v.StrVal } + v, ok = metric.Options["command"] + if ok { + m.command = v.StrVal + } v, ok = metric.Options["blocksize"] if ok { m.blocksize = v.StrVal @@ -67,62 +90,84 @@ func (m *Fio) SetOptions(metric *api.Metric) { if ok { m.iodepth = int(v.IntVal) } + v, ok = metric.Options["prefix"] + if ok { + m.prefix = v.StrVal + } + v, ok = metric.Options["pre"] + if ok { + m.pre = v.StrVal + } + v, ok = metric.Options["post"] + if ok { + m.post = v.StrVal + } } -// Generate the entrypoint for measuring the storage -func (m Fio) EntrypointScripts( +func (m Fio) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { - // Prepare metadata for set and separator - metadata := metrics.Metadata(spec, metric) - template := `#!/bin/bash + // Metadata to add to beginning of run + meta := metrics.Metadata(spec, metric) + // Assemble the command first. This way, the user can define the entire thing OR we can control it + command := "%s fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=%s --bs=%s --iodepth=%d --readwrite=randrw --rwmixread=75 --size=%s --filename=$filename --output-format=json" + command = fmt.Sprintf( + command, + m.prefix, + m.testname, + m.blocksize, + m.iodepth, + m.size, + ) + // Overwrite with user command + if m.command != "" { + command = m.command + } + + preBlock := `#!/bin/bash echo "%s" # Directory (and filename) for test assuming other storage mounts filename=%s/test-$(cat /dev/urandom | tr -cd 'a-f0-9' | head -c 32) # Run the pre-command here so it has access to the filename. %s -command="%s fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=%s --bs=%s --iodepth=%d --readwrite=randrw --rwmixread=75 --size=%s --filename=$filename --output-format=json" +command="%s" echo "FIO COMMAND START" echo $command echo "FIO COMMAND END" # FIO just has one command, we don't need to think about completions / etc! echo "%s" echo "%s" -$command +` + preBlock = fmt.Sprintf( + preBlock, + meta, + m.directory, + m.pre, + command, + metadata.CollectionStart, + metadata.Separator, + ) + + postBlock := ` echo "%s" # Run command here so it's after collection finish, but before removing the filename %s %s rm -rf $filename -%s +%s ` - script := fmt.Sprintf( - template, - metadata, - m.directory, - spec.Spec.Storage.Commands.Pre, - spec.Spec.Storage.Commands.Prefix, - m.testname, - m.blocksize, - m.iodepth, - m.size, - metrics.CollectionStart, - metrics.Separator, - metrics.CollectionEnd, - spec.Spec.Storage.Commands.Post, - spec.Spec.Storage.Commands.Prefix, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - // The entrypoint is the entrypoint for the container, while - // the command is expected to be what we are monitoring. Often - // they are the same thing. We return an empty Name so it's automatically - // assigned - return []metrics.EntrypointScript{ - {Script: script}, - } + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + postBlock = fmt.Sprintf( + postBlock, + metadata.CollectionEnd, + m.post, + m.prefix, + interactive, + ) + return m.StorageContainerSpec(preBlock, "$command", postBlock) } // Exported options and list options @@ -133,15 +178,17 @@ func (m Fio) Options() map[string]intstr.IntOrString { "iodepth": intstr.FromInt(m.iodepth), "size": intstr.FromString(m.size), "directory": intstr.FromString(m.directory), + "command": intstr.FromString(m.command), } } func init() { - storage := jobs.StorageGeneric{ - Identifier: "io-fio", - Summary: "Flexible IO Tester (FIO)", - Container: "ghcr.io/converged-computing/metric-fio:latest", + base := metrics.BaseMetric{ + Identifier: fioIdentifier, + Summary: fioSummary, + Container: fioContainer, } + storage := metrics.StorageGeneric{BaseMetric: base} fio := Fio{StorageGeneric: storage} metrics.Register(&fio) } diff --git a/pkg/metrics/io/ior.go b/pkg/metrics/io/ior.go index 3fd5621..3890f0e 100644 --- a/pkg/metrics/io/ior.go +++ b/pkg/metrics/io/ior.go @@ -10,22 +10,31 @@ package io import ( "fmt" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" +) + +const ( + iorIdentifier = "io-ior" + iorSummary = "HPC IO Benchmark" + iorContainer = "ghcr.io/converged-computing/metric-ior:latest" ) // Ior means Flexible IO // https://docs.gitlab.com/ee/administration/operations/filesystem_benchmarking.html type Ior struct { - jobs.StorageGeneric + metrics.StorageGeneric // Options workdir string command string + pre string + post string } func (m Ior) Url() string { @@ -37,6 +46,10 @@ func (m *Ior) SetOptions(metric *api.Metric) { m.ResourceSpec = &metric.Resources m.AttributeSpec = &metric.Attributes + m.Identifier = iorIdentifier + m.Container = iorContainer + m.Summary = iorSummary + // Set defaults for options m.command = "ior -w -r -o testfile" m.workdir = "/opt/ior" @@ -53,44 +66,53 @@ func (m *Ior) SetOptions(metric *api.Metric) { if ok { m.workdir = workdir.StrVal } + v, ok := metric.Options["pre"] + if ok { + m.pre = v.StrVal + } + v, ok = metric.Options["post"] + if ok { + m.post = v.StrVal + } } -// Generate the entrypoint for measuring the storage -func (m Ior) EntrypointScripts( +func (m Ior) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { - // Prepare metadata for set and separator - metadata := metrics.Metadata(spec, metric) - template := `#!/bin/bash + // Metadata to add to beginning of run + meta := metrics.Metadata(spec, metric) + + preBlock := `#!/bin/bash echo "%s" # Directory (and filename) for test assuming other storage mounts cd %s echo "%s" echo "%s" -%s +` + + postBlock := ` echo "%s" %s %s -%s ` - script := fmt.Sprintf( - template, - metadata, + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + preBlock = fmt.Sprintf( + preBlock, + meta, m.workdir, - metrics.CollectionStart, - metrics.Separator, - m.command, - metrics.CollectionEnd, - spec.Spec.Storage.Commands.Post, - spec.Spec.Storage.Commands.Prefix, - metrics.Interactive(spec.Spec.Logging.Interactive), + metadata.CollectionStart, + metadata.Separator, ) - return []metrics.EntrypointScript{ - {Script: script}, - } + postBlock = fmt.Sprintf( + postBlock, + metadata.CollectionEnd, + m.post, + interactive, + ) + return m.StorageContainerSpec(preBlock, m.command, postBlock) } // Exported options and list options @@ -102,11 +124,12 @@ func (m Ior) Options() map[string]intstr.IntOrString { } func init() { - storage := jobs.StorageGeneric{ - Identifier: "io-ior", - Summary: "HPC IO Benchmark", - Container: "ghcr.io/converged-computing/metric-ior:latest", + base := metrics.BaseMetric{ + Identifier: iorIdentifier, + Summary: iorSummary, + Container: iorContainer, } + storage := metrics.StorageGeneric{BaseMetric: base} Ior := Ior{StorageGeneric: storage} metrics.Register(&Ior) } diff --git a/pkg/metrics/io/sysstat.go b/pkg/metrics/io/sysstat.go index dee4f31..6c52241 100644 --- a/pkg/metrics/io/sysstat.go +++ b/pkg/metrics/io/sysstat.go @@ -11,21 +11,32 @@ import ( "fmt" "strconv" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" +) + +const ( + iostatIdentifier = "io-sysstat" + iostatSummary = "statistics for Linux tasks (processes) : I/O, CPU, memory, etc." + iostatContainer = "ghcr.io/converged-computing/metric-sysstat:latest" ) // sysstat provides a tool "iostat" to assess a storage mount // https://github.com/sysstat/sysstat type IOStat struct { - jobs.StorageGeneric + metrics.StorageGeneric humanReadable bool rate int32 completions int32 + + // pre and post commands + pre string + post string } func (m IOStat) Url() string { @@ -34,6 +45,11 @@ func (m IOStat) Url() string { // Set custom options / attributes for the metric func (m *IOStat) SetOptions(metric *api.Metric) { + + m.Identifier = iostatIdentifier + m.Summary = iostatSummary + m.Container = iostatContainer + m.rate = 10 m.completions = 0 // infinite m.ResourceSpec = &metric.Resources @@ -46,6 +62,15 @@ func (m *IOStat) SetOptions(metric *api.Metric) { m.humanReadable = true } } + v, ok := metric.Options["pre"] + if ok { + m.pre = v.StrVal + } + v, ok = metric.Options["post"] + if ok { + m.post = v.StrVal + } + rate, ok := metric.Options["rate"] if ok { m.rate = rate.IntVal @@ -57,20 +82,20 @@ func (m *IOStat) SetOptions(metric *api.Metric) { } -// Generate the entrypoint for measuring the storage -func (m IOStat) EntrypointScripts( +func (m IOStat) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { - // Prepare metadata for set and separator - metadata := metrics.Metadata(spec, metric) + // Metadata to add to beginning of run + meta := metrics.Metadata(spec, metric) command := "iostat -dxm -o JSON" if m.humanReadable { command = "iostat -dxm" } - template := `#!/bin/bash -# Custom pre command + + preBlock := `#!/bin/bash +# Custom pre comamand logic %s i=0 echo "%s" @@ -89,32 +114,28 @@ while true sleep %d let i=i+1 done -# Custom post command after done, if we get here +` + + postBlock := ` %s %s ` - script := fmt.Sprintf( - template, - spec.Spec.Storage.Commands.Pre, - metadata, + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + preBlock = fmt.Sprintf( + preBlock, + m.pre, + meta, m.completions, - metrics.CollectionStart, - metrics.Separator, + metadata.CollectionStart, + metadata.Separator, command, - metrics.CollectionEnd, - spec.Spec.Storage.Commands.Post, + metadata.CollectionEnd, + metadata.CollectionEnd, m.rate, - spec.Spec.Storage.Commands.Post, - metrics.Interactive(spec.Spec.Logging.Interactive), ) - // The entrypoint is the entrypoint for the container, while - // the command is expected to be what we are monitoring. Often - // they are the same thing. We return an empty Name so it's automatically - // assigned - return []metrics.EntrypointScript{ - {Script: script}, - } + postBlock = fmt.Sprintf(postBlock, m.post, interactive) + return m.StorageContainerSpec(preBlock, "", postBlock) } // Exported options and list options @@ -127,11 +148,12 @@ func (m IOStat) Options() map[string]intstr.IntOrString { } func init() { - storage := jobs.StorageGeneric{ - Identifier: "io-sysstat", - Summary: "statistics for Linux tasks (processes) : I/O, CPU, memory, etc.", - Container: "ghcr.io/converged-computing/metric-sysstat:latest", + base := metrics.BaseMetric{ + Identifier: iostatIdentifier, + Summary: iostatSummary, + Container: iostatContainer, } + storage := metrics.StorageGeneric{BaseMetric: base} iostat := IOStat{StorageGeneric: storage} metrics.Register(&iostat) } diff --git a/pkg/metrics/jobset.go b/pkg/metrics/jobset.go index f7a0768..e558a01 100644 --- a/pkg/metrics/jobset.go +++ b/pkg/metrics/jobset.go @@ -10,11 +10,11 @@ package metrics // Each type of metric returns a replicated job that can be put into a common JobSet import ( - batchv1 "k8s.io/api/batch/v1" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/specs" jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" ) @@ -30,44 +30,70 @@ var ( const podLabelAppName = "app.kubernetes.io/name" -// GetJobSet is called by the controller to return some JobSet based -// on the type: application, storage, or standalone +// GetJobSet is called by the controller to return a JobSet for the MetricSet func GetJobSet( spec *api.MetricSet, - sets *map[string]MetricSet, -) ([]*jobset.JobSet, error) { + set *MetricSet, +) (*jobset.JobSet, []*specs.ContainerSpec, error) { + containerSpecs := []*specs.ContainerSpec{} - // Assume we can eventually support >1 jobset - jobsets := []*jobset.JobSet{} + // TODO each metric needs to provide some listing of success jobs... + // Success Set we expect some subset of the replicated job names + successJobs := getSuccessJobs(set.Metrics()) - // Assume we have one jobset type - for _, set := range *sets { - // For a standalone, we expect one JobSet with 1+ replicatedJobs, and a custom - // Success Set we expect some subset of the replicated job names - successJobs := getSuccessJobs(set.Metrics()) + // A base JobSet can hold one or more replicated jobs + js := getBaseJobSet(spec, successJobs) - // A base JobSet can hold one or more replicated jobs - js := getBaseJobSet(spec, successJobs) + // Get one or more replicated jobs, some number from each metric + rjs := []jobset.ReplicatedJob{} - // Get one or more replicated jobs, depending on the type - rjs, err := set.ReplicatedJobs(spec) + // Get one replicated job per metric, and for each, extend with addons + for _, metric := range set.Metrics() { + + // The metric exposes it's own replicated jobs + // Since these are custom functions, we add addons / containers / volumes consistently after + m := (*metric) + jobs, err := m.ReplicatedJobs(spec) + if err != nil { + return js, containerSpecs, err + } + + // Generate container specs for the metric, each is associated with a replicated job + // The containers are paired with entrypoints, and also with the replicated jobs + // We do this so we can match addons easily. The only reason we do this outside + // of the loop below is to allow shared logic. + cs := m.PrepareContainers(spec, &m) + + // Prepare container and volume specs (that are changeable) e.g., + // 1. Create VolumeSpec across metrics and addons that can predefine volumes + // 2. Create ContainerSpec across metrics that can predefine containers, entrypoints, volumes + // 3. Container specs (cms) returned are expected to be config maps that need to be written + cms, err := m.AddAddons(spec, jobs, cs) if err != nil { - return jobsets, err + return js, containerSpecs, err } - // Get those replicated Jobs. - js.Spec.ReplicatedJobs = rjs - jobsets = append(jobsets, js) + // Add the finalized container specs for the entire set of replicated jobs + // We need this at the end to hand back to generate config maps + containerSpecs = append(containerSpecs, cs...) + containerSpecs = append(containerSpecs, cms...) + + // Add the final set of jobs (bad decision for the pointer here, oops) + for _, job := range jobs { + rjs = append(rjs, (*job)) + } } - return jobsets, nil + + // Get those replicated Jobs. + js.Spec.ReplicatedJobs = rjs + return js, containerSpecs, nil } // Get list of strings that define successful for a jobset. // Since these are from replicatedJobs in metrics, we collect from there func getSuccessJobs(metrics []*Metric) []string { - // Success jobs are always the default replicatedJobName for storage and application - // Use a map akin to a set + // Each metric can define if it's jobs are required for success successJobs := map[string]bool{} for _, m := range metrics { for _, sj := range (*m).SuccessJobs() { @@ -117,88 +143,6 @@ func getBaseJobSet(set *api.MetricSet, successSet []string) *jobset.JobSet { return &js } -// getReplicatedJob returns the base of the replicated job -func GetReplicatedJob( - set *api.MetricSet, - shareProcessNamespace bool, - pods int32, - completions int32, - jobname string, - soleTenancy bool, -) (*jobset.ReplicatedJob, error) { - - // Default replicated job name, if not set - if jobname == "" { - jobname = ReplicatedJobName - } - - // Pod labels from the MetricSet - podLabels := set.GetPodLabels() - - // Always indexed completion mode to have predictable hostnames - completionMode := batchv1.IndexedCompletion - - // We only expect one replicated job (for now) so give it a short name for DNS - job := jobset.ReplicatedJob{ - Name: jobname, - Template: batchv1.JobTemplateSpec{ - ObjectMeta: metav1.ObjectMeta{ - Name: set.Name, - Namespace: set.Namespace, - }, - }, - // This is the default, but let's be explicit - Replicas: 1, - } - - // This should default to true - setAsFDQN := !set.Spec.DontSetFQDN - - // Create the JobSpec for the job -> Template -> Spec - jobspec := batchv1.JobSpec{ - BackoffLimit: &backoffLimit, - Parallelism: &pods, - Completions: &completions, - CompletionMode: &completionMode, - ActiveDeadlineSeconds: &set.Spec.DeadlineSeconds, - - // Note there is parameter to limit runtime - Template: corev1.PodTemplateSpec{ - ObjectMeta: metav1.ObjectMeta{ - Name: set.Name, - Namespace: set.Namespace, - Labels: podLabels, - }, - Spec: corev1.PodSpec{ - // matches the service - Subdomain: set.Spec.ServiceName, - RestartPolicy: corev1.RestartPolicyOnFailure, - - // This is important to share the process namespace! - SetHostnameAsFQDN: &setAsFDQN, - ShareProcessNamespace: &shareProcessNamespace, - ServiceAccountName: set.Spec.Pod.ServiceAccountName, - NodeSelector: set.Spec.Pod.NodeSelector, - }, - }, - } - - // Do we want sole tenancy? - if soleTenancy { - jobspec.Template.Spec.Affinity = getAffinity(set) - } - - // Do we have a pull secret for the application image? - if set.Spec.Application.PullSecret != "" { - jobspec.Template.Spec.ImagePullSecrets = []corev1.LocalObjectReference{ - {Name: set.Spec.Application.PullSecret}, - } - } - // Tie the jobspec to the job - job.Template.Spec = jobspec - return &job, nil -} - // getAffinity returns to pod affinity to ensure 1 address / node func getAffinity(set *api.MetricSet) *corev1.Affinity { return &corev1.Affinity{ diff --git a/pkg/metrics/launcher.go b/pkg/metrics/launcher.go new file mode 100644 index 0000000..b13d5d7 --- /dev/null +++ b/pkg/metrics/launcher.go @@ -0,0 +1,291 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package metrics + +import ( + "fmt" + + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/metadata" + "github.com/converged-computing/metrics-operator/pkg/specs" + jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" +) + +// These are common templates for standalone apps. +// They define the interface of a Metric. + +// These are used for network and job names, etc. +var ( + defaultLauncherLetter = "l" + defaultWorkerLetter = "w" +) + +// LauncherWorker is a launcher + worker setup for apps. These need to +// be accessible by other packages (and not conflict with function names) +type LauncherWorker struct { + BaseMetric + ResourceSpec *api.ContainerResources + AttributeSpec *api.ContainerSpec + + // A metric can have one or more addons + Addons []*api.MetricAddon + + // Most laucher workers have a command + Command string + Prefix string + + // Scripts + WorkerScript string + LauncherScript string + LauncherLetter string + WorkerContainer string + LauncherContainer string + WorkerLetter string +} + +// Family returns a generic performance family +func (m LauncherWorker) Family() string { + return PerformanceFamily +} + +// Jobs required for success condition (n is the LauncherWorker run) +func (m *LauncherWorker) SuccessJobs() []string { + m.ensureDefaultNames() + return []string{m.LauncherLetter} +} + +// Set default options / attributes for the launcher metric +func (m *LauncherWorker) SetDefaultOptions(metric *api.Metric) { + m.ResourceSpec = &metric.Resources + m.AttributeSpec = &metric.Attributes + + command, ok := metric.Options["command"] + if ok { + m.Command = command.StrVal + } + workdir, ok := metric.Options["workdir"] + if ok { + m.Workdir = workdir.StrVal + } + prefix, ok := metric.Options["prefix"] + if ok { + m.Prefix = prefix.StrVal + } +} + +// Ensure the worker and launcher default names are set +func (m *LauncherWorker) ensureDefaultNames() { + // Ensure we set the default launcher letter, if not set + if m.LauncherLetter == "" { + m.LauncherLetter = defaultLauncherLetter + } + if m.WorkerLetter == "" { + m.WorkerLetter = defaultWorkerLetter + } + if m.LauncherScript == "" { + m.LauncherScript = "/metrics_operator/launcher.sh" + } + if m.WorkerScript == "" { + m.WorkerScript = "/metrics_operator/worker.sh" + } + if m.LauncherContainer == "" { + m.LauncherContainer = "launcher" + } + if m.WorkerContainer == "" { + m.WorkerContainer = "workers" + } +} + +func (m *LauncherWorker) PrepareContainers( + spec *api.MetricSet, + metric *Metric, +) []*specs.ContainerSpec { + + // Metadata to add to beginning of run + meta := Metadata(spec, metric) + hosts := m.GetHostlist(spec) + prefix := m.GetCommonPrefix(meta, m.Command, hosts) + + preBlock := ` +echo "%s" +` + + postBlock := ` +echo "%s" +%s +` + command := fmt.Sprintf("%s ./problem.sh", m.Prefix) + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + preBlock = prefix + fmt.Sprintf(preBlock, metadata.Separator) + postBlock = fmt.Sprintf(postBlock, metadata.CollectionEnd, interactive) + + // Entrypoint for the launcher + launcherEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.LauncherScript), + Path: m.LauncherScript, + Pre: preBlock, + Command: command, + Post: postBlock, + } + + // Entrypoint for the worker + workerEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.WorkerScript), + Path: m.WorkerScript, + Pre: prefix, + Command: "sleep infinity", + } + + // Container spec for the launcher + launcherContainer := m.GetLauncherContainerSpec(launcherEntrypoint) + workerContainer := m.GetWorkerContainerSpec(workerEntrypoint) + + // Return the script templates for each of launcher and worker + return []*specs.ContainerSpec{&launcherContainer, &workerContainer} +} + +// GetCommonPrefix returns a common prefix for the worker/ launcher script, setting up hosts, etc. +func (m *LauncherWorker) GetCommonPrefix( + meta string, + command string, + hosts string, +) string { + + // Generate problem.sh with command only if we have one! + if command != "" { + command = fmt.Sprintf(`# Write the command file +cat < ./problem.sh +#!/bin/bash +%s +EOF +chmod +x ./problem.sh`, command) + } + + prefixTemplate := `#!/bin/bash +# Start ssh daemon +/usr/sbin/sshd -D & +echo "%s" +# Write the hosts file +cat < ./hostlist.txt +%s +EOF + +%s + +# Allow network to ready (this could be a variable) +echo "Sleeping for 10 seconds waiting for network..." +sleep 10 +echo "%s" +` + return fmt.Sprintf( + prefixTemplate, + meta, + hosts, + command, + metadata.CollectionStart, + ) +} + +// AddWorkers generates worker jobs, only if we have them +func (m *LauncherWorker) AddWorkers(spec *api.MetricSet) (*jobset.ReplicatedJob, error) { + + numWorkers := spec.Spec.Pods - 1 + workers, err := AssembleReplicatedJob(spec, false, numWorkers, numWorkers, m.WorkerLetter, m.SoleTenancy) + if err != nil { + return workers, err + } + return workers, nil +} + +func (m *LauncherWorker) GetLauncherContainerSpec( + entrypoint specs.EntrypointScript, +) specs.ContainerSpec { + spec := specs.ContainerSpec{ + JobName: m.LauncherLetter, + Image: m.Image(), + Name: m.LauncherContainer, + EntrypointScript: entrypoint, + Resources: m.ResourceSpec, + Attributes: m.AttributeSpec, + } + if m.Workdir != "" { + spec.WorkingDir = m.Workdir + } + return spec +} +func (m *LauncherWorker) GetWorkerContainerSpec( + entrypoint specs.EntrypointScript, +) specs.ContainerSpec { + + // Container spec for the launcher + spec := specs.ContainerSpec{ + JobName: m.WorkerLetter, + Image: m.Image(), + Name: m.WorkerContainer, + EntrypointScript: entrypoint, + Resources: m.ResourceSpec, + Attributes: m.AttributeSpec, + } + if m.Workdir != "" { + spec.WorkingDir = m.Workdir + } + return spec +} + +// Replicated Jobs are custom for a launcher worker +func (m *LauncherWorker) ReplicatedJobs(spec *api.MetricSet) ([]*jobset.ReplicatedJob, error) { + + js := []*jobset.ReplicatedJob{} + m.ensureDefaultNames() + + // Generate a replicated job for the launcher (LauncherWorker) and workers + launcher, err := AssembleReplicatedJob(spec, false, 1, 1, m.LauncherLetter, m.SoleTenancy) + if err != nil { + return js, err + } + + numWorkers := spec.Spec.Pods - 1 + var workers *jobset.ReplicatedJob + + // Generate the replicated job with just a launcher, or launcher and workers + if numWorkers > 0 { + workers, err = m.AddWorkers(spec) + if err != nil { + return js, err + } + js = []*jobset.ReplicatedJob{launcher, workers} + } else { + js = []*jobset.ReplicatedJob{launcher} + } + return js, nil +} + +// Validate that we can run a network. At least one launcher and worker is required +func (m LauncherWorker) Validate(spec *api.MetricSet) bool { + isValid := spec.Spec.Pods >= 2 + if !isValid { + logger.Errorf("Pods for a Launcher Worker app must be >=2. This app is invalid.") + } + return isValid +} + +// Get common hostlist for launcher/worker app +func (m *LauncherWorker) GetHostlist(spec *api.MetricSet) string { + m.ensureDefaultNames() + + // The launcher has a different hostname, n for netmark + hosts := fmt.Sprintf("%s-%s-0-0.%s.%s.svc.cluster.local\n", + spec.Name, m.LauncherLetter, spec.Spec.ServiceName, spec.Namespace, + ) + // Add number of workers + for i := 0; i < int(spec.Spec.Pods-1); i++ { + hosts += fmt.Sprintf("%s-%s-0-%d.%s.%s.svc.cluster.local\n", + spec.Name, m.WorkerLetter, i, spec.Spec.ServiceName, spec.Namespace) + } + return hosts +} diff --git a/pkg/metrics/logs.go b/pkg/metrics/logs.go index 2dfa198..41d269b 100644 --- a/pkg/metrics/logs.go +++ b/pkg/metrics/logs.go @@ -12,83 +12,30 @@ import ( "fmt" "log" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/metadata" "github.com/converged-computing/metrics-operator/pkg/utils" "go.uber.org/zap" - "k8s.io/apimachinery/pkg/util/intstr" ) // Consistent logging identifiers that should be echoed to have newline after var ( - Separator = "METRICS OPERATOR TIMEPOINT" - CollectionStart = "METRICS OPERATOR COLLECTION START" - CollectionEnd = "METRICS OPERATOR COLLECTION END" - handle *zap.Logger - logger *zap.SugaredLogger + logger *zap.SugaredLogger ) -// Metric Export is a flattened structure with minimal required metadata for now -// It would be nice if we could just dump everything. -type MetricExport struct { - - // Global - Pods int32 `json:"pods"` - Completions int32 `json:"completions"` - - // Application - ApplicationImage string `json:"applicationImage,omitempty"` - ApplicationCommand string `json:"applicationCommand,omitempty"` - - // Storage - StorageVolumePath string `json:"storageVolumePath,omitempty"` - StorageVolumeHostPath string `json:"storageVolumeHostPath,omitempty"` - StorageVolumeSecretName string `json:"storageVolumeSecretName,omitempty"` - StorageVolumeClaimName string `json:"storageVolumeClaimName,omitempty"` - StorageVolumeConfigMapName string `json:"storageVolumeConfigMapName,omitempty"` - - // Metric - MetricName string `json:"metricName,omitempty"` - MetricDescription string `json:"metricDescription,omitempty"` - MetricType string `json:"metricType,omitempty"` - MetricOptions map[string]intstr.IntOrString `json:"metricOptions,omitempty"` - MetricListOptions map[string][]intstr.IntOrString `json:"metricListOptions,omitempty"` -} - - -// Interactive returns a sleep infinity if interactive is true -func Interactive(interactive bool) string { - if interactive { - return "sleep infinity" - } - return "" -} - // Default metadata (in JSON) to also put at the top of logs for parsing // I'd like to improve upon this manual approach, it's a bit messy. func Metadata(set *api.MetricSet, metric *Metric) string { m := (*metric) - export := MetricExport{ + export := metadata.MetricExport{ // Global - Pods: set.Spec.Pods, - Completions: set.Spec.Completions, - - // Application - ApplicationImage: set.Spec.Application.Image, - ApplicationCommand: set.Spec.Application.Command, - - // Storage - StorageVolumePath: set.Spec.Storage.Volume.Path, - StorageVolumeHostPath: set.Spec.Storage.Volume.HostPath, - StorageVolumeSecretName: set.Spec.Storage.Volume.SecretName, - StorageVolumeClaimName: set.Spec.Storage.Volume.ClaimName, - StorageVolumeConfigMapName: set.Spec.Storage.Volume.ConfigMapName, + Pods: set.Spec.Pods, // Metric MetricName: m.Name(), MetricDescription: m.Description(), - MetricType: m.Type(), MetricOptions: m.Options(), MetricListOptions: m.ListOptions(), } diff --git a/pkg/metrics/metrics.go b/pkg/metrics/metrics.go index b49fbf3..b85ebf3 100644 --- a/pkg/metrics/metrics.go +++ b/pkg/metrics/metrics.go @@ -10,67 +10,92 @@ package metrics import ( "fmt" "log" + "reflect" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + addons "github.com/converged-computing/metrics-operator/pkg/addons" + "github.com/converged-computing/metrics-operator/pkg/specs" "k8s.io/apimachinery/pkg/util/intstr" jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" ) var ( - Registry = make(map[string]Metric) + Registry = map[string]Metric{} ) -// A general metric produces a JobSet with one or more replicated Jobs +// A general metric is a container added to a JobSet type Metric interface { + + // Metadata Name() string Description() string + Family() string Url() string + // Container attributes + Image() string + SetContainer(string) + + // Options and exportable attributes SetOptions(*api.Metric) - Validate(*api.MetricSet) bool - HasSoleTenancy() bool + Options() map[string]intstr.IntOrString + ListOptions() map[string][]intstr.IntOrString - // Attributes to expose for containers - WorkingDir() string - Image() string - Family() string + // Validation and append addons + Validate(*api.MetricSet) bool + RegisterAddon(*addons.Addon) + AddAddons(*api.MetricSet, []*jobset.ReplicatedJob, []*specs.ContainerSpec) ([]*specs.ContainerSpec, error) + GetAddons() []*addons.Addon - // One or more replicated jobs to populate a JobSet - ReplicatedJobs(*api.MetricSet) ([]jobset.ReplicatedJob, error) + // Attributes for JobSet, etc. + HasSoleTenancy() bool + ReplicatedJobs(*api.MetricSet) ([]*jobset.ReplicatedJob, error) SuccessJobs() []string - GetVolumes() map[string]api.Volume Resources() *api.ContainerResources Attributes() *api.ContainerSpec - // Metric type to know how to add to MetricSet - Type() string - - // Exportable attributes - Options() map[string]intstr.IntOrString - ListOptions() map[string][]intstr.IntOrString - - // EntrypointScripts are required to generate ConfigMaps - EntrypointScripts(*api.MetricSet, *Metric) []EntrypointScript + // Prepare Containers. These are used to generate configmaps, + // and populate the respective replicated jobs with containers! + PrepareContainers(*api.MetricSet, *Metric) []*specs.ContainerSpec } -// GetMetric returns the Component specified by name from `Registry`. +// GetMetric returns a metric, if it is known to the metrics operator +// We also confirm that the addon exists, validate, and instantiate it. func GetMetric(metric *api.Metric, set *api.MetricSet) (Metric, error) { + if _, ok := Registry[metric.Name]; ok { - m := Registry[metric.Name] - // Ensure the type is one acceptable - if !(m.Type() == ApplicationMetric || m.Type() == StorageMetric || m.Type() == StandaloneMetric) { - return nil, fmt.Errorf("%s is not a valid type", metric.Name) + // Start with the empty template, and create a copy + // This is important so we don't preserve state to the actaul interface + template := Registry[metric.Name] + templateType := reflect.ValueOf(template) + if templateType.Kind() == reflect.Ptr { + templateType = reflect.Indirect(templateType) } + m := reflect.New(templateType.Type()).Interface().(Metric) // Set global and custom options on the registry metric from the CRD m.SetOptions(metric) + // If the metric has a custom container, set here + if metric.Image != "" { + m.SetContainer(metric.Image) + } + + // Register addons, meaning adding the spec but not instantiating yet (or should we?) + for _, a := range metric.Addons { + + addon, err := addons.GetAddon(&a) + if err != nil { + return nil, fmt.Errorf("Addon %s for metric %s did not validate", a.Name, metric.Name) + } + m.RegisterAddon(&addon) + } + // After options are set, final validation if !m.Validate(set) { return nil, fmt.Errorf("%s did not validate", metric.Name) } - return m, nil } return nil, fmt.Errorf("%s is not a registered Metric type", metric.Name) @@ -80,7 +105,7 @@ func GetMetric(metric *api.Metric, set *api.MetricSet) (Metric, error) { func Register(m Metric) { name := m.Name() if _, ok := Registry[name]; ok { - log.Fatalf("Metric: %s has already been added to the registry", name) + log.Fatalf("Metric: %s has already been added to the registry\n", m) } Registry[name] = m } diff --git a/pkg/metrics/metricset.go b/pkg/metrics/metricset.go deleted file mode 100644 index 7912fd1..0000000 --- a/pkg/metrics/metricset.go +++ /dev/null @@ -1,165 +0,0 @@ -/* -Copyright 2023 Lawrence Livermore National Security, LLC - (c.f. AUTHORS, NOTICE.LLNS, COPYING) - -SPDX-License-Identifier: MIT -*/ - -package metrics - -import ( - "fmt" - "log" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" -) - -var ( - RegistrySet = make(map[string]MetricSet) -) - -const ( - // Metric Design Types - ApplicationMetric = "application" - StorageMetric = "storage" - StandaloneMetric = "standalone" - - // Metric Family Types (these likely can be changed) - StorageFamily = "storage" - MachineLearningFamily = "machine-learning" - NetworkFamily = "network" - SimulationFamily = "simulation" - SolverFamily = "solver" - - // Generic (more than one type, CPU/io, etc) - ProxyAppFamily = "proxyapp" - PerformanceFamily = "performance" -) - -// A MetricSet interface holds one or more Metrics -// and exposes the JobSet -type MetricSet interface { - - // Metric Set Type (string) - Type() string - Add(m *Metric) - Exists(m *Metric) bool - Metrics() []*Metric - EntrypointScripts(*api.MetricSet) []EntrypointScript - ReplicatedJobs(*api.MetricSet) ([]jobset.ReplicatedJob, error) -} - -// get an application default entrypoint, if not determined by metric -// NOTE: if the default is not used, we currently just support one metric -// that requires a volume or custom logic. This could be changed -// but my brain is too goobley right now. -func getApplicationDefaultEntrypoint(set *api.MetricSet) string { - template := `#!/bin/bash - exec %s -` - return fmt.Sprintf(template, set.Spec.Application.Entrypoint) -} - -// ConsolidateEntrypointScripts from a metric set into one list -func consolidateEntrypointScripts(metrics []*Metric, set *api.MetricSet) []EntrypointScript { - scripts := []EntrypointScript{} - seenApplicationEntry := false - for _, metric := range metrics { - for _, script := range (*metric).EntrypointScripts(set, metric) { - if script.Path == DefaultApplicationEntrypoint { - seenApplicationEntry = true - } - scripts = append(scripts, script) - } - } - - // If we have an application and we haven't seen the application-0.sh, add it - if set.HasApplication() && !seenApplicationEntry { - script := getApplicationDefaultEntrypoint(set) - scripts = append(scripts, EntrypointScript{ - Script: script, - Path: DefaultApplicationEntrypoint, - Name: DefaultApplicationName, - }) - } - return scripts -} - -// BaseMetricSet -type BaseMetricSet struct { - name string - metrics []*Metric - metricNames map[string]bool -} - -func (m BaseMetricSet) Metrics() []*Metric { - return m.metrics -} -func (m BaseMetricSet) Type() string { - return m.name -} -func (m BaseMetricSet) Exists(metric *Metric) bool { - _, ok := m.metricNames[(*metric).Name()] - return ok -} - -// Determine if any metrics in the set need sole tenancy -// This is defined on the level of the jobset for now -func (m BaseMetricSet) HasSoleTenancy() bool { - for _, m := range m.metrics { - if (*m).HasSoleTenancy() { - return true - } - } - return false -} - -func (m *BaseMetricSet) Add(metric *Metric) { - if !m.Exists(metric) { - m.metrics = append(m.metrics, metric) - m.metricNames[(*metric).Name()] = true - } -} -func (m *BaseMetricSet) EntrypointScripts(set *api.MetricSet) []EntrypointScript { - return consolidateEntrypointScripts(m.metrics, set) -} - -// Types of Metrics: Storage, Application, and Standalone - -// StorageMetricSet defines a MetricSet to measure storage interfaces -type StorageMetricSet struct { - BaseMetricSet -} - -// ApplicationMetricSet defines a MetricSet to measure application performance -type ApplicationMetricSet struct { - BaseMetricSet -} -type StandaloneMetricSet struct { - BaseMetricSet -} - -// Register a new Metric type, adding it to the Registry -func RegisterSet(m MetricSet) { - name := m.Type() - if _, ok := RegistrySet[name]; ok { - log.Fatalf("MetricSet: %s has already been added to the registry", name) - } - RegistrySet[name] = m -} - -// GetMetric returns the Component specified by name from `Registry`. -func GetMetricSet(name string) (MetricSet, error) { - if _, ok := RegistrySet[name]; ok { - m := RegistrySet[name] - return m, nil - } - return nil, fmt.Errorf("%s is not a registered MetricSet type", name) -} - -func init() { - RegisterSet(&StorageMetricSet{BaseMetricSet{name: StorageMetric, metricNames: map[string]bool{}}}) - RegisterSet(&ApplicationMetricSet{BaseMetricSet{name: ApplicationMetric, metricNames: map[string]bool{}}}) - RegisterSet(&StandaloneMetricSet{BaseMetricSet{name: StandaloneMetric, metricNames: map[string]bool{}}}) -} diff --git a/pkg/metrics/network/chatterbug.go b/pkg/metrics/network/chatterbug.go index 173a799..b96d0af 100644 --- a/pkg/metrics/network/chatterbug.go +++ b/pkg/metrics/network/chatterbug.go @@ -11,16 +11,23 @@ import ( "fmt" "path" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - jobs "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" ) // ghcr.io/converged-computing/metric-osu-benchmark:latest // https://mvapich.cse.ohio-state.edu/benchmarks/ +const ( + cbIdentifier = "network-chatterbug" + cbSummary = "A suite of communication proxies for HPC applications" + cbContainer = "ghcr.io/converged-computing/metric-chatterbug:latest" +) + var ( // Directory (app) name and executable in /root/chatterbug @@ -37,7 +44,7 @@ var ( ) type Chatterbug struct { - jobs.LauncherWorker + metrics.LauncherWorker // Custom options command string @@ -63,6 +70,10 @@ func (m *Chatterbug) hasCommand(command string) bool { func (m *Chatterbug) SetOptions(metric *api.Metric) { m.lookup = map[string]bool{} + m.Identifier = cbIdentifier + m.Container = cbContainer + m.Summary = cbSummary + // Default command and args (for a demo) m.command = "stencil3d" m.args = "./stencil3d.x 2 2 2 10 10 10 4 1" @@ -118,16 +129,16 @@ func (n Chatterbug) Family() string { return metrics.NetworkFamily } -// Return lookup of entrypoint scripts -func (m Chatterbug) EntrypointScripts( +func (m Chatterbug) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) - hosts := m.GetHostlist(spec) + meta := metrics.Metadata(spec, metric) + // The launcher has a different hostname, n for netmark + hosts := m.GetHostlist(spec) prefixTemplate := `#!/bin/bash # Start ssh daemon /usr/sbin/sshd -D & @@ -167,43 +178,63 @@ done cat ./hostlist.txt # Show metadata for run echo "%s" -sleep infinity ` prefix := fmt.Sprintf( prefixTemplate, m.tasks, spec.Spec.Pods, hosts, - metadata, + meta, ) // Prepare command for chatterbug - commands := fmt.Sprintf("\nsleep 5\necho %s\n", metrics.CollectionStart) + commands := fmt.Sprintf("\nsleep 5\necho %s\n", metadata.CollectionStart) // Full path to, e.g., /root/chatterbug/stencil3d/stencil3d.x command := path.Join("/root/chatterbug", m.command, ChatterbugApps[m.command]) line := fmt.Sprintf("mpirun --hostfile ./hostlist.txt --allow-run-as-root %s %s %s", m.mpirun, command, m.args) - commands += fmt.Sprintf("echo %s\necho \"%s\"\n%s\n", metrics.Separator, line, line) - // Close the commands block - commands += fmt.Sprintf("echo %s\n", metrics.CollectionEnd) + commands += fmt.Sprintf("echo %s\necho \"%s\"\n", metadata.Separator, line) + + // The pre block has the prefix and commands, up to the echo of the command (line) + preBlock := fmt.Sprintf("%s\n%s", prefix, commands) + + // The post block has the collection end and interactive option + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + postBlock := fmt.Sprintf("echo %s\n%s\n", metadata.CollectionEnd, interactive) + + // The worker just has a preBlock with the prefix and the command is to sleep + launcherEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.LauncherScript), + Path: m.LauncherScript, + Pre: preBlock, + Command: line, + Post: postBlock, + } + + // Entrypoint for the worker + workerEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.WorkerScript), + Path: m.WorkerScript, + Pre: prefix, + Command: "sleep infinity", + } - // Template for the launcher with interactive mode, if desired - launcherTemplate := fmt.Sprintf("%s\n%s\n%s", prefix, commands, metrics.Interactive(spec.Spec.Logging.Interactive)) + // Container spec for the launcher + launcherContainer := m.GetLauncherContainerSpec(launcherEntrypoint) + workerContainer := m.GetWorkerContainerSpec(workerEntrypoint) - // The worker just has sleep infinity added, and getting the ip address of the launcher - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) + // Return the script templates for each of launcher and worker + return []*specs.ContainerSpec{&launcherContainer, &workerContainer} } func init() { - launcher := jobs.LauncherWorker{ - Identifier: "network-chatterbug", - Summary: "A suite of communication proxies for HPC applications", - Container: "ghcr.io/converged-computing/metric-chatterbug:latest", - WorkerScript: "/metrics_operator/chatterbug-worker.sh", - LauncherScript: "/metrics_operator/chatterbug-launcher.sh", + base := metrics.BaseMetric{ + Identifier: cbIdentifier, + Summary: cbSummary, + Container: cbContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} bug := Chatterbug{LauncherWorker: launcher} metrics.Register(&bug) } diff --git a/pkg/metrics/network/netmark.go b/pkg/metrics/network/netmark.go index 86e4d97..898e18d 100644 --- a/pkg/metrics/network/netmark.go +++ b/pkg/metrics/network/netmark.go @@ -11,17 +11,23 @@ import ( "fmt" "strconv" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - jobs "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" ) // This library is currently private +const ( + netmarkIdentifier = "network-netmark" + netmarkSummary = "point to point networking tool" + netmarkContainer = "vanessa/netmark:latest" +) type Netmark struct { - jobs.LauncherWorker + metrics.LauncherWorker // Options tasks int32 @@ -57,6 +63,10 @@ func (m *Netmark) SetOptions(metric *api.Metric) { m.AttributeSpec = &metric.Attributes m.LauncherLetter = "n" + m.Identifier = netmarkIdentifier + m.Summary = netmarkSummary + m.Container = netmarkContainer + // One pod per hostname m.SoleTenancy = true @@ -74,6 +84,12 @@ func (m *Netmark) SetOptions(metric *api.Metric) { if ok { m.tasks = tasks.IntVal } + st, ok := metric.Options["soleTenancy"] + if ok { + if st.StrVal == "false" || st.StrVal == "no" { + m.SoleTenancy = false + } + } warmups, ok := metric.Options["warmups"] if ok { m.warmups = warmups.IntVal @@ -113,14 +129,15 @@ func (n Netmark) Options() map[string]intstr.IntOrString { } } -// Return lookup of entrypoint scripts -func (m Netmark) EntrypointScripts( +func (m Netmark) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) + meta := metrics.Metadata(spec, metric) + + // The launcher has a different hostname, n for netmark hosts := m.GetHostlist(spec) // Add boolean flag to store the trial? @@ -154,16 +171,27 @@ echo "%s" ` prefix := fmt.Sprintf( prefixTemplate, - metadata, + meta, m.tasks, spec.Spec.Pods, hosts, - metrics.CollectionStart, + metadata.CollectionStart, ) - // Template for the launcher - template := ` -mpirun -f ./hostlist.txt -np $np /usr/local/bin/netmark.x -w %d -t %d -c %d -b %d %s + // Netmark main command + command := "mpirun -f ./hostlist.txt -np $np /usr/local/bin/netmark.x -w %d -t %d -c %d -b %d %s" + command = fmt.Sprintf( + command, + m.warmups, + m.trials, + m.sendReceiveCycles, + m.messageSize, + storeTrial, + ) + // The preBlock is also the prefix + preBlock := prefix + + postBlock := ` ls echo "NETMARK RTT.CSV START" cat RTT.csv @@ -171,30 +199,45 @@ echo "NETMARK RTT.CSV END" echo "%s" %s ` - launcherTemplate := prefix + fmt.Sprintf( - template, - m.warmups, - m.trials, - m.sendReceiveCycles, - m.messageSize, - storeTrial, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + postBlock = fmt.Sprintf( + postBlock, + metadata.CollectionEnd, + interactive, ) - // The worker just has sleep infinity added - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) + // The worker just has a preBlock with the prefix and the command is to sleep + launcherEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.LauncherScript), + Path: m.LauncherScript, + Pre: preBlock, + Command: command, + Post: postBlock, + } + + // Entrypoint for the worker + workerEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.WorkerScript), + Path: m.WorkerScript, + Pre: prefix, + Command: "sleep infinity", + } + + // Container spec for the launcher + launcherContainer := m.GetLauncherContainerSpec(launcherEntrypoint) + workerContainer := m.GetWorkerContainerSpec(workerEntrypoint) + + // Return the script templates for each of launcher and worker + return []*specs.ContainerSpec{&launcherContainer, &workerContainer} } func init() { - launcher := jobs.LauncherWorker{ - Identifier: "network-netmark", - Summary: "point to point networking tool", - Container: "vanessa/netmark:latest", - WorkerScript: "/metrics_operator/netmark-worker.sh", - LauncherScript: "/metrics_operator/netmark-launcher.sh", + base := metrics.BaseMetric{ + Identifier: netmarkIdentifier, + Summary: netmarkSummary, + Container: netmarkContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} netmark := Netmark{LauncherWorker: launcher} metrics.Register(&netmark) } diff --git a/pkg/metrics/network/osu-benchmark.go b/pkg/metrics/network/osu-benchmark.go index 7854b3c..2f8e960 100644 --- a/pkg/metrics/network/osu-benchmark.go +++ b/pkg/metrics/network/osu-benchmark.go @@ -11,15 +11,21 @@ import ( "fmt" "path" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" "k8s.io/apimachinery/pkg/util/intstr" - jobs "github.com/converged-computing/metrics-operator/pkg/jobs" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" ) // ghcr.io/converged-computing/metric-osu-benchmark:latest // https://mvapich.cse.ohio-state.edu/benchmarks/ +const ( + OSUIdentifier = "network-osu-benchmark" + OSUSummary = "point to point MPI benchmarks" + OSUContainer = "ghcr.io/converged-computing/metric-osu-benchmark:latest" +) type BenchmarkConfig struct { Workdir string @@ -105,7 +111,7 @@ var ( ) type OSUBenchmark struct { - jobs.LauncherWorker + metrics.LauncherWorker // Custom options commands []string @@ -114,6 +120,7 @@ type OSUBenchmark struct { runAll bool flags string timed bool + sleep int32 } func (m OSUBenchmark) Url() string { @@ -134,8 +141,14 @@ func (m *OSUBenchmark) addCommand(command string) { // Set custom options / attributes for the metric func (m *OSUBenchmark) SetOptions(metric *api.Metric) { + + m.Identifier = OSUIdentifier + m.Container = OSUContainer + m.Summary = OSUSummary + m.lookup = map[string]bool{} m.commands = []string{} + m.sleep = 60 m.ResourceSpec = &metric.Resources m.AttributeSpec = &metric.Attributes @@ -159,7 +172,11 @@ func (m *OSUBenchmark) SetOptions(metric *api.Metric) { if ok { m.tasks = tasks.IntVal } - st, ok := metric.Options["sole-tenancy"] + sleep, ok := metric.Options["sleep"] + if ok { + m.sleep = sleep.IntVal + } + st, ok := metric.Options["soleTenancy"] if ok && st.StrVal == "false" || st.StrVal == "no" { m.SoleTenancy = false } @@ -229,18 +246,16 @@ func (n OSUBenchmark) Family() string { return metrics.NetworkFamily } -// Return lookup of entrypoint scripts -func (m OSUBenchmark) EntrypointScripts( +func (m OSUBenchmark) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) + meta := metrics.Metadata(spec, metric) // The launcher has a different hostname, n for netmark hosts := m.GetHostlist(spec) - prefixTemplate := `#!/bin/bash # Start ssh daemon /usr/sbin/sshd -D & @@ -259,8 +274,9 @@ echo "Number of tasks (nproc on one node) is $tasks" echo "Number of tasks total (across $pods nodes) is $np" # Allow network to ready (we need the hostnames / ip addresses to be there) -echo "Sleeping for 60 seconds waiting for network..." -sleep 60 +sleeptime=%d +echo "Sleeping for ${sleeptime} seconds waiting for network..." +sleep ${sleeptime} # Write the hosts file. cat < ./hostnames.txt @@ -285,9 +301,10 @@ echo "%s" prefixTemplate, m.tasks, spec.Spec.Pods, + m.sleep, hosts, metrics.TemplateConvertHostnames, - metadata, + meta, ) // Do we want timed? @@ -295,11 +312,12 @@ echo "%s" if m.timed { mpirun = "time mpirun" } + // Prepare list of commands, e.g., // mpirun -f ./hostlist.txt -np 2 ./osu_acc_latency (mpich) // mpirun --hostfile ./hostfile.txt --allow-run-as-root -N 2 -np 2 ./osu_fop_latency (openmpi) // Sleep a little more to allow worker to write launcher hostname - commands := fmt.Sprintf("\nsleep 5\necho %s\n", metrics.CollectionStart) + commands := fmt.Sprintf("\nsleep 5\necho %s\n", metadata.CollectionStart) for _, executable := range m.commands { workDir := osuBenchmarkCommands[executable].Workdir @@ -319,28 +337,47 @@ echo "%s" } else { line = fmt.Sprintf("%s --hostfile %s --allow-run-as-root %s %s", mpirun, hostfile, flags, command) } - commands += fmt.Sprintf("echo %s\necho \"%s\"\n%s\n", metrics.Separator, line, line) + commands += fmt.Sprintf("echo %s\necho \"%s\"\n%s\n", metadata.Separator, line, line) + } + + // The pre block has the prefix and commands + preBlock := fmt.Sprintf("%s\n%s", prefix, commands) + + // The post block is just closing the colletion, and optionally interactive mode + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + postBlock := fmt.Sprintf("echo %s\n%s\n", metadata.CollectionEnd, interactive) + + // The worker just has a preBlock with the prefix and the command is to sleep + launcherEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.LauncherScript), + Path: m.LauncherScript, + Pre: preBlock, + Post: postBlock, } - // Close the commands block - commands += fmt.Sprintf("echo %s\n", metrics.CollectionEnd) + // Entrypoint for the worker + workerEntrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(m.WorkerScript), + Path: m.WorkerScript, + Pre: prefix, + Command: "sleep infinity", + } - // Template for the launcher with interactive mode, if desired - launcherTemplate := fmt.Sprintf("%s\n%s\n%s", prefix, commands, metrics.Interactive(spec.Spec.Logging.Interactive)) + // Container spec for the launcher + launcherContainer := m.GetLauncherContainerSpec(launcherEntrypoint) + workerContainer := m.GetWorkerContainerSpec(workerEntrypoint) - // The worker just has sleep infinity added, and getting the ip address of the launcher - workerTemplate := prefix + "\nsleep infinity" - return m.FinalizeEntrypoints(launcherTemplate, workerTemplate) + // Return the script templates for each of launcher and worker + return []*specs.ContainerSpec{&launcherContainer, &workerContainer} } func init() { - launcher := jobs.LauncherWorker{ - Identifier: "network-osu-benchmark", - Summary: "point to point MPI benchmarks", - Container: "ghcr.io/converged-computing/metric-osu-benchmark:latest", - WorkerScript: "/metrics_operator/osu-worker.sh", - LauncherScript: "/metrics_operator/osu-launcher.sh", + base := metrics.BaseMetric{ + Identifier: OSUIdentifier, + Summary: OSUSummary, + Container: OSUContainer, } + launcher := metrics.LauncherWorker{BaseMetric: base} osu := OSUBenchmark{LauncherWorker: launcher} metrics.Register(&osu) } diff --git a/pkg/metrics/perf/hpctoolkit.go b/pkg/metrics/perf/hpctoolkit.go deleted file mode 100644 index 3229243..0000000 --- a/pkg/metrics/perf/hpctoolkit.go +++ /dev/null @@ -1,190 +0,0 @@ -/* -Copyright 2023 Lawrence Livermore National Security, LLC - (c.f. AUTHORS, NOTICE.LLNS, COPYING) - -SPDX-License-Identifier: MIT -*/ - -package perf - -import ( - "fmt" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - "github.com/converged-computing/metrics-operator/pkg/jobs" - metrics "github.com/converged-computing/metrics-operator/pkg/metrics" - "k8s.io/apimachinery/pkg/util/intstr" -) - -type HPCToolkit struct { - jobs.SingleApplication - events string - mount string -} - -func (m HPCToolkit) Url() string { - return "https://gitlab.com/hpctoolkit/hpctoolkit" -} - -// GetVolumes to provide an empty volume for the application to share -func (m HPCToolkit) GetVolumes() map[string]api.Volume { - return map[string]api.Volume{ - "hpctoolkit": { - Path: m.mount, - EmptyVol: true, - }, - } -} - -// Validate we have an executable provided, and args and optional -func (m *HPCToolkit) Validate(ms *api.MetricSet) bool { - if m.events == "" { - logger.Error("One or more events for hpcrun (events) are required (e.g., -e IO).") - return false - } - return true -} - -// Set custom options / attributes for the metric -func (m *HPCToolkit) SetOptions(metric *api.Metric) { - // Defaults for rate and completions - m.ResourceSpec = &metric.Resources - m.AttributeSpec = &metric.Attributes - m.mount = "/opt/share" - - // UseColor set to anything means to use it - mount, ok := metric.Options["mount"] - if ok { - m.mount = mount.StrVal - } - events, ok := metric.Options["events"] - if ok { - m.events = events.StrVal - } -} - -// Exported options and list options -func (m HPCToolkit) Options() map[string]intstr.IntOrString { - return map[string]intstr.IntOrString{ - "events": intstr.FromString(m.events), - "mount": intstr.FromString(m.mount), - } -} - -// Generate the replicated job for measuring the application -// TODO if the app is too fast we might miss it? -func (m HPCToolkit) EntrypointScripts( - spec *api.MetricSet, - metric *metrics.Metric, -) []metrics.EntrypointScript { - - // This is the metric container entrypoint. - // The sole purpose is just to provide the volume, meaning copying content there - template := `#!/bin/bash - -echo "Moving content from /opt/view to be in shared volume at %s" -view=$(ls /opt/views/._view/) -view="/opt/views/._view/${view}" - -# Give a little extra wait time -sleep 10 - -viewroot="%s" -mkdir -p $viewroot/view -# We have to move both of these paths, *sigh* -cp -R ${view}/* $viewroot/view -cp -R /opt/software $viewroot/ - -# Sleep forever, the application needs to run and end -echo "Sleeping forever so %s can be shared and use for hpctoolkit." -sleep infinity -` - script := fmt.Sprintf( - template, - m.mount, - m.mount, - m.mount, - ) - - // Custom logic for application entrypoint - metadata := metrics.Metadata(spec, metric) - custom := ` - -# Ensure hpcrun and software exists. This is rough, but should be OK with enough wait time -wget https://github.com/converged-computing/goshare/releases/download/2023-09-06/wait-fs -chmod +x ./wait-fs -mv ./wait-fs /usr/bin/goshare-wait-fs - -# Ensure spack view is on the path, wherever it is mounted -viewbase="%s" -software="${viewbase}/software" -viewbin="${viewbase}/view/bin" -export PATH=${viewbin}:$PATH - -# Wait for software directory, and give it time -goshare-wait-fs -p ${software} - -# Wait for copy to finish -sleep 10 - -# Copy mount software to /opt/software -cp -R %s/software /opt/software - -# Wait for hpcrun -goshare-wait-fs -p ${viewbin}/hpcrun - -# This will work with capability SYS_ADMIN added. -# It will only work with privileged set to true AT YOUR OWN RISK! -echo "-1" | tee /proc/sys/kernel/perf_event_paranoid - -# Run hpcrun. See options with hpcrun -L -events="%s" -echo "%s" -echo "%s" -echo "%s" - -# Commands to interact with output data -# hpcprof hpctoolkit-sleep-measurements -# hpcstruct hpctoolkit-sleep-measurements -# hpcviewer ./hpctoolkit-lmp-database -` - - custom = fmt.Sprintf( - custom, - m.mount, - m.mount, - m.events, - metadata, - metrics.CollectionStart, - metrics.Separator, - ) - - // And the suffix (post run) - suffix := ` -echo "%s" -%s -` - suffix = fmt.Sprintf( - suffix, - metrics.CollectionEnd, - metrics.Interactive(spec.Spec.Logging.Interactive), - ) - - // NOTE: for this container the metrics entrypoint just copies and then - // waits, and the custom application entrypoint runs the wrapped application - // command. - return []metrics.EntrypointScript{ - {Script: script}, - m.ApplicationEntrypoint(spec, custom, "hpcrun $events", suffix), - } -} - -func init() { - app := jobs.SingleApplication{ - Identifier: "perf-hpctoolkit", - Summary: "performance tools for measurement and analysis", - Container: "ghcr.io/converged-computing/metric-hpctoolkit-view:latest", - } - HPCToolkit := HPCToolkit{SingleApplication: app} - metrics.Register(&HPCToolkit) -} diff --git a/pkg/metrics/perf/sysstat.go b/pkg/metrics/perf/sysstat.go index 81ba204..c6d821e 100644 --- a/pkg/metrics/perf/sysstat.go +++ b/pkg/metrics/perf/sysstat.go @@ -11,17 +11,24 @@ import ( "fmt" "strconv" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - "github.com/converged-computing/metrics-operator/pkg/jobs" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/metadata" metrics "github.com/converged-computing/metrics-operator/pkg/metrics" + "github.com/converged-computing/metrics-operator/pkg/specs" "k8s.io/apimachinery/pkg/util/intstr" ) +const ( + pidstatIdentifier = "perf-sysstat" + pidstatSummary = "statistics for Linux tasks (processes) : I/O, CPU, memory, etc." + pidstatContainer = "ghcr.io/converged-computing/metric-sysstat:latest" +) + // sysstat provides a tool "pidstat" that can monitor a PID (along with others) // https://github.com/sysstat/sysstat type PidStat struct { - jobs.SingleApplication + metrics.SingleApplication // Custom Options useColor bool @@ -29,6 +36,7 @@ type PidStat struct { useThreads bool rate int32 completions int32 + command string commands map[string]intstr.IntOrString } @@ -38,6 +46,11 @@ func (m PidStat) Url() string { // Set custom options / attributes for the metric func (m *PidStat) SetOptions(metric *api.Metric) { + + m.Identifier = pidstatIdentifier + m.Summary = pidstatSummary + m.Container = pidstatContainer + // Defaults for rate and completions m.rate = 10 m.completions = 0 // infinite @@ -74,6 +87,10 @@ func (m *PidStat) SetOptions(metric *api.Metric) { if ok { m.commands = commands } + command, ok := metric.Options["command"] + if ok { + m.command = command.StrVal + } } @@ -104,7 +121,7 @@ func (m PidStat) prepareIndexedCommand(spec *api.MetricSet) string { if len(m.commands) == 0 { // This is a global command for the entire application - command = fmt.Sprintf("command=\"%s\"\n", spec.Spec.Application.Command) + command = fmt.Sprintf("command=\"%s\"\n", m.command) } else { @@ -136,15 +153,13 @@ func (m PidStat) prepareIndexedCommand(spec *api.MetricSet) string { return command } -// Generate the replicated job for measuring the application -// TODO if the app is too fast we might miss it? -func (m PidStat) EntrypointScripts( +func (m PidStat) PrepareContainers( spec *api.MetricSet, metric *metrics.Metric, -) []metrics.EntrypointScript { +) []*specs.ContainerSpec { // Metadata to add to beginning of run - metadata := metrics.Metadata(spec, metric) + meta := metrics.Metadata(spec, metric) useColor := "" if !m.useColor { @@ -160,19 +175,19 @@ func (m PidStat) EntrypointScripts( if m.useThreads { useThreads = " -t " } - // Prepare custom logic to determine command + command := m.prepareIndexedCommand(spec) - template := `#!/bin/bash + preBlock := `#!/bin/bash echo "%s" # Download the wait binary wget https://github.com/converged-computing/goshare/releases/download/2023-07-27/wait > /dev/null chmod +x ./wait mv ./wait /usr/bin/goshare-wait - + # Do we want to use threads? threads="%s" - + # This is logic to determine the command, it will set $command # We do this because command to watch can vary between worker pods %s @@ -181,10 +196,10 @@ echo "$command" echo "PIDSTAT COMMAND END" echo "Waiting for application PID..." pid=$(goshare-wait -c "$command" -q) - + # Set color or not %s - + # See https://kellyjonbrazil.github.io/jc/docs/parsers/pidstat # for how we get lovely json i=0 @@ -192,16 +207,16 @@ completions=%d echo "%s" while true do - echo "%s" + echo "%s" %s - echo "CPU STATISTICS TASK" - pidstat -p ${pid} -u -h $threads -T TASK | jc --pidstat - echo "CPU STATISTICS CHILD" - pidstat -p ${pid} -u -h $threads -T CHILD | jc --pidstat + echo "CPU STATISTICS TASK" + pidstat -p ${pid} -u -h $threads -T TASK | jc --pidstat + echo "CPU STATISTICS CHILD" + pidstat -p ${pid} -u -h $threads -T CHILD | jc --pidstat echo "IO STATISTICS" - pidstat -p ${pid} -d -h $threads -T ALL | jc --pidstat + pidstat -p ${pid} -d -h $threads -T ALL | jc --pidstat echo "POLICY" - pidstat -p ${pid} -R -h $threads -T ALL | jc --pidstat + pidstat -p ${pid} -R -h $threads -T ALL | jc --pidstat echo "PAGEFAULTS TASK" pidstat -p ${pid} -r -h $threads -T TASK | jc --pidstat echo "PAGEFAULTS CHILD" @@ -218,51 +233,46 @@ while true pidstat -p ${pid} -w -h $threads -T ALL | jc --pidstat # Check if still running ps -p ${pid} > /dev/null - retval=$? - if [[ $retval -ne 0 ]]; then - echo "%s" - exit 0 - fi - if [[ $completions -ne 0 ]] && [[ $i -eq $completions ]]; then - echo "%s" - exit 0 - fi - sleep %d - let i=i+1 + retval=$? + if [[ $retval -ne 0 ]]; then + echo "%s" + exit 0 + fi + if [[ $completions -ne 0 ]] && [[ $i -eq $completions ]]; then + echo "%s" + exit 0 + fi + sleep %d + let i=i+1 done -%s ` - script := fmt.Sprintf( - template, - metadata, + interactive := metadata.Interactive(spec.Spec.Logging.Interactive) + preBlock = fmt.Sprintf( + preBlock, + meta, useThreads, command, useColor, m.completions, - metrics.CollectionStart, - metrics.Separator, + metadata.CollectionStart, + metadata.Separator, showPIDS, - metrics.CollectionEnd, - metrics.CollectionEnd, + metadata.CollectionEnd, + metadata.CollectionEnd, m.rate, - metrics.Interactive(spec.Spec.Logging.Interactive), ) - - // NOTE: the entrypoint is the entrypoint for the container, while - // the command is expected to be what we are monitoring. Often - // they are the same thing. - return []metrics.EntrypointScript{ - {Script: script}, - } + postBlock := fmt.Sprintf("\n%s\n", interactive) + return m.ApplicationContainerSpec(preBlock, command, postBlock) } func init() { - app := jobs.SingleApplication{ - Identifier: "perf-sysstat", - Summary: "statistics for Linux tasks (processes) : I/O, CPU, memory, etc.", - Container: "ghcr.io/converged-computing/metric-sysstat:latest", + base := metrics.BaseMetric{ + Identifier: pidstatIdentifier, + Summary: pidstatSummary, + Container: pidstatContainer, } + app := metrics.SingleApplication{BaseMetric: base} pidstat := PidStat{SingleApplication: app} metrics.Register(&pidstat) } diff --git a/pkg/metrics/resources.go b/pkg/metrics/resources.go index bd2df96..4c059fb 100644 --- a/pkg/metrics/resources.go +++ b/pkg/metrics/resources.go @@ -10,7 +10,7 @@ package metrics import ( "fmt" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" corev1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/api/resource" diff --git a/pkg/metrics/set.go b/pkg/metrics/set.go new file mode 100644 index 0000000..506487e --- /dev/null +++ b/pkg/metrics/set.go @@ -0,0 +1,146 @@ +/* +Copyright 2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package metrics + +import ( + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + batchv1 "k8s.io/api/batch/v1" + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" +) + +var ( + RegistrySet = make(map[string]MetricSet) +) + +const ( + + // Metric Family Types (these likely can be changed) + StorageFamily = "storage" + MachineLearningFamily = "machine-learning" + NetworkFamily = "network" + SimulationFamily = "simulation" + SolverFamily = "solver" + + // Generic (more than one type, CPU/io, etc) + ProxyAppFamily = "proxyapp" + PerformanceFamily = "performance" +) + +// A MetricSet includes one or more metrics that are assembled into a JobSet +type MetricSet struct { + metrics []*Metric + metricNames map[string]bool +} + +func (m MetricSet) Metrics() []*Metric { + return m.metrics +} +func (m MetricSet) Exists(metric *Metric) bool { + _, ok := m.metricNames[(*metric).Name()] + return ok +} + +// Determine if any metrics in the set need sole tenancy +// This is defined on the level of the jobset for now +func (m MetricSet) HasSoleTenancy() bool { + for _, m := range m.metrics { + if (*m).HasSoleTenancy() { + return true + } + } + return false +} + +func (ms *MetricSet) Add(metric *Metric) { + if ms.metricNames == nil { + ms.metricNames = map[string]bool{} + } + m := (*metric) + if !ms.Exists(metric) { + ms.metrics = append(ms.metrics, metric) + ms.metricNames[m.Name()] = true + } +} + +// AssembleReplicatedJob is used by metrics to assemble a custom, replicated job. +func AssembleReplicatedJob( + set *api.MetricSet, + shareProcessNamespace bool, + pods int32, + completions int32, + jobname string, + soleTenancy bool, +) (*jobset.ReplicatedJob, error) { + + // Default replicated job name, if not set + if jobname == "" { + jobname = ReplicatedJobName + } + + // Pod labels from the MetricSet + podLabels := set.GetPodLabels() + + // Always indexed completion mode to have predictable hostnames + completionMode := batchv1.IndexedCompletion + + // We only expect one replicated job (for now) so give it a short name for DNS + job := jobset.ReplicatedJob{ + Name: jobname, + Template: batchv1.JobTemplateSpec{ + ObjectMeta: metav1.ObjectMeta{ + Name: set.Name, + Namespace: set.Namespace, + }, + }, + // This is the default, but let's be explicit + Replicas: 1, + } + + // This should default to true + setAsFDQN := !set.Spec.DontSetFQDN + + // Create the JobSpec for the job -> Template -> Spec + jobspec := batchv1.JobSpec{ + BackoffLimit: &backoffLimit, + Parallelism: &pods, + Completions: &completions, + CompletionMode: &completionMode, + ActiveDeadlineSeconds: &set.Spec.DeadlineSeconds, + + // Note there is parameter to limit runtime + Template: corev1.PodTemplateSpec{ + ObjectMeta: metav1.ObjectMeta{ + Name: set.Name, + Namespace: set.Namespace, + Labels: podLabels, + }, + Spec: corev1.PodSpec{ + // matches the service + Subdomain: set.Spec.ServiceName, + RestartPolicy: corev1.RestartPolicyOnFailure, + + // This is important to share the process namespace! + SetHostnameAsFQDN: &setAsFDQN, + ShareProcessNamespace: &shareProcessNamespace, + ServiceAccountName: set.Spec.Pod.ServiceAccountName, + NodeSelector: set.Spec.Pod.NodeSelector, + }, + }, + } + + // Do we want sole tenancy? + if soleTenancy { + jobspec.Template.Spec.Affinity = getAffinity(set) + } + + // Tie the jobspec to the job + job.Template.Spec = jobspec + return &job, nil +} diff --git a/pkg/metrics/standalone.go b/pkg/metrics/standalone.go deleted file mode 100644 index fa86993..0000000 --- a/pkg/metrics/standalone.go +++ /dev/null @@ -1,50 +0,0 @@ -/* -Copyright 2023 Lawrence Livermore National Security, LLC - (c.f. AUTHORS, NOTICE.LLNS, COPYING) - -SPDX-License-Identifier: MIT -*/ - -package metrics - -import ( - "fmt" - - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" -) - -// A Standalone metric is typically going to provide its own logic for one or more replicated jobss -func (m *StandaloneMetricSet) ReplicatedJobs(spec *api.MetricSet) ([]jobset.ReplicatedJob, error) { - rjs := []jobset.ReplicatedJob{} - for _, metric := range m.Metrics() { - jobs, err := GetStandaloneReplicatedJobs(spec, metric, spec.Spec.Application.Volumes) - if err != nil { - return rjs, err - } - rjs = append(rjs, jobs...) - } - return rjs, nil -} - -// Create a standalone JobSet, one without volumes or application -// This will be definition be a JobSet for only one metric -func GetStandaloneReplicatedJobs( - spec *api.MetricSet, - metric *Metric, - volumes map[string]api.Volume, -) ([]jobset.ReplicatedJob, error) { - - m := (*metric) - - // Does the metric provide its own logic? - rjs, err := m.ReplicatedJobs(spec) - if err != nil { - return rjs, err - } - - if len(rjs) == 0 { - return rjs, fmt.Errorf("custom standalone metrics require a replicated job set") - } - return rjs, nil -} diff --git a/pkg/metrics/storage.go b/pkg/metrics/storage.go index dd1c9e3..2c2ff6a 100644 --- a/pkg/metrics/storage.go +++ b/pkg/metrics/storage.go @@ -8,46 +8,49 @@ SPDX-License-Identifier: MIT package metrics import ( - api "github.com/converged-computing/metrics-operator/api/v1alpha1" - jobset "sigs.k8s.io/jobset/api/jobset/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/specs" ) -// Get ReplicatedJobs intended to run storage -// For this setup, we expect to create a container for each storage metric -// And then add the volume bind to it -func (m *StorageMetricSet) ReplicatedJobs(spec *api.MetricSet) ([]jobset.ReplicatedJob, error) { +// These are common templates for storage apps. +// They define the interface of a Metric. - // Prepare replicated jobs list to return - rjs := []jobset.ReplicatedJob{} - - // Storage metrics do not need to share the process namespace - // The jobname empty string will use the default, no custom replicated job name, and sole tenancy false - job, err := GetReplicatedJob(spec, false, spec.Spec.Pods, spec.Spec.Completions, "", m.HasSoleTenancy()) - if err != nil { - return rjs, err - } +type StorageGeneric struct { + BaseMetric +} - // Only add storage volume if we have it! Not all storage interfaces require - // A Kubernetes abstraction, some are created via a command. - volumes := map[string]api.Volume{} - if spec.HasStorageVolume() { - // Add volumes expecting an application. - // A storage app is required to have a volume - volumes = map[string]api.Volume{"storage": spec.Spec.Storage.Volume} - } +// Family returns the storage family +func (m StorageGeneric) Family() string { + return StorageFamily +} - // Derive running scripts from the metric - runnerScripts := GetMetricsKeyToPath(m.Metrics()) - job.Template.Spec.Template.Spec.Volumes = GetVolumes(spec, runnerScripts, volumes) +// By default assume storage does not have sole tenancy +func (m StorageGeneric) HasSoleTenancy() bool { + return false +} - // Derive the containers, one per metric - // This will also include mounts for volumes - containers, err := getContainers(spec, m.Metrics(), volumes) - if err != nil { - return rjs, err +// StorageContainerSpec gets the storage container spec +// This is identical to the application spec and could be combined +func (m *StorageGeneric) StorageContainerSpec( + preBlock string, + command string, + postBlock string, +) []*specs.ContainerSpec { + + entrypoint := specs.EntrypointScript{ + Name: specs.DeriveScriptKey(DefaultEntrypointScript), + Path: DefaultEntrypointScript, + Pre: preBlock, + Command: command, + Post: postBlock, } - job.Template.Spec.Template.Spec.Containers = containers - rjs = append(rjs, *job) - return rjs, nil + return []*specs.ContainerSpec{{ + JobName: ReplicatedJobName, + Image: m.Image(), + Name: "storage", + WorkingDir: m.Workdir, + EntrypointScript: entrypoint, + Resources: m.ResourceSpec, + Attributes: m.AttributeSpec, + }} } diff --git a/pkg/metrics/volumes.go b/pkg/metrics/volumes.go index f38af65..bda3f73 100644 --- a/pkg/metrics/volumes.go +++ b/pkg/metrics/volumes.go @@ -8,17 +8,24 @@ SPDX-License-Identifier: MIT package metrics import ( - "fmt" + "path/filepath" - api "github.com/converged-computing/metrics-operator/api/v1alpha1" + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + "github.com/converged-computing/metrics-operator/pkg/specs" corev1 "k8s.io/api/core/v1" ) +var ( + makeExecutable = int32(0777) +) + // GetVolumeMounts returns read only volume for entrypoint scripts, etc. func getVolumeMounts( set *api.MetricSet, - volumes map[string]api.Volume, + volumes []specs.VolumeSpec, ) []corev1.VolumeMount { + + // This is for the core entrypoints (that are generated as config maps here) mounts := []corev1.VolumeMount{ { Name: set.Name, @@ -27,28 +34,34 @@ func getVolumeMounts( }, } - // Add on application volumes/claims - for volumeName, volume := range volumes { - mount := corev1.VolumeMount{ - Name: volumeName, - MountPath: volume.Path, - ReadOnly: volume.ReadOnly, + // This is for any extra or special entrypoints + for _, vs := range volumes { + + // Is this volume indicated for mount? + if vs.Mount { + mount := corev1.VolumeMount{ + Name: vs.Volume.Name, + MountPath: vs.Path, + ReadOnly: vs.ReadOnly, + } + mounts = append(mounts, mount) } - mounts = append(mounts, mount) } return mounts } // Get MetricsKeyToPath assumes we have a predictible listing of metrics // scripts. This is applicable for storage and application metrics -func GetMetricsKeyToPath(metrics []*Metric) []corev1.KeyToPath { +func generateOperatorItems(containerSpecs []*specs.ContainerSpec) []corev1.KeyToPath { // Each metric has an entrypoint script runnerScripts := []corev1.KeyToPath{} - for i, _ := range metrics { - key := fmt.Sprintf("entrypoint-%d", i) + for _, cs := range containerSpecs { + + // This is relative to the directory + path := filepath.Base(cs.EntrypointScript.Path) runnerScript := corev1.KeyToPath{ - Key: key, - Path: key + ".sh", + Key: cs.EntrypointScript.Name, + Path: path, Mode: &makeExecutable, } runnerScripts = append(runnerScripts, runnerScript) @@ -56,62 +69,44 @@ func GetMetricsKeyToPath(metrics []*Metric) []corev1.KeyToPath { return runnerScripts } -// getVolumes adds expected entrypoints along with added volumes (storage or applications) -// This function is intended for a set with a listing of metrics -func GetVolumes( - set *api.MetricSet, - runnerScripts []corev1.KeyToPath, - addedVolumes map[string]api.Volume, -) []corev1.Volume { +// Add extra config maps to the metrics_operator set from addons +// These are distinct because the operator needs to create them too +func getExtraConfigmaps(volumes []specs.VolumeSpec) []corev1.KeyToPath { - // TODO will need to add volumes to here for storage requests / metrics - volumes := []corev1.Volume{ - { - Name: set.Name, - VolumeSource: corev1.VolumeSource{ - ConfigMap: &corev1.ConfigMapVolumeSource{ + // Each metric has an entrypoint script + runnerScripts := []corev1.KeyToPath{} - // Namespace based on the cluster - LocalObjectReference: corev1.LocalObjectReference{ - Name: set.Name, - }, - Items: runnerScripts, - }, - }, - }, + for _, addedVolume := range volumes { + + // Check that the typs is config map + if addedVolume.Volume.ConfigMap == nil { + continue + } + // This will error if it's not a config map :) + if addedVolume.Volume.Name == "" { + for _, item := range addedVolume.Volume.ConfigMap.Items { + runnerScripts = append(runnerScripts, item) + } + } } - existingVolumes := getExistingVolumes(addedVolumes) - volumes = append(volumes, existingVolumes...) - return volumes + return runnerScripts } -// GetStandaloneVolumes is intended for a single metric, where the volumes -// are provided as custom EntrypointScripts -func GetStandaloneVolumes( +// getVolumes adds expected entrypoints along with added volumes (storage or applications) +// This function is intended for a set with a listing of metrics +func getReplicatedJobVolumes( set *api.MetricSet, - scripts []EntrypointScript, - addedVolumes map[string]api.Volume, + cs []*specs.ContainerSpec, + addedVolumes []specs.VolumeSpec, ) []corev1.Volume { - // Runner start scripts - makeExecutable := int32(0777) + // These are for the main entrypoints in /metrics_operator + runnerScripts := generateOperatorItems(cs) - // Each metric has an entrypoint script - runnerScripts := []corev1.KeyToPath{} - for i, script := range scripts { - key := script.Name - if key == "" { - key = fmt.Sprintf("entrypoint-%d", i) - } - runnerScript := corev1.KeyToPath{ - Key: key, - Path: key + ".sh", - Mode: &makeExecutable, - } - runnerScripts = append(runnerScripts, runnerScript) - } + // Any volumes that don't have a Name in added we need to generate under the operator + extraCMs := getExtraConfigmaps(addedVolumes) + runnerScripts = append(runnerScripts, extraCMs...) - // TODO will need to add volumes to here for storage requests / metrics volumes := []corev1.Volume{ { Name: set.Name, @@ -127,90 +122,22 @@ func GetStandaloneVolumes( }, }, } - - existingVolumes := getExistingVolumes(addedVolumes) + existingVolumes := getAddonVolumes(addedVolumes) volumes = append(volumes, existingVolumes...) return volumes } -// Get Existing volumes for the cluster. This can include: -// config maps -// secrets -// persistent volumes / claims -func getExistingVolumes(existing map[string]api.Volume) []corev1.Volume { +// Get Addon Volumes for the cluster. This can include: +func getAddonVolumes(vs []specs.VolumeSpec) []corev1.Volume { volumes := []corev1.Volume{} - for volumeName, volumeMeta := range existing { - - var newVolume corev1.Volume - - // Empty vol is typically used internally for an application share - if volumeMeta.EmptyVol { - newVolume = corev1.Volume{ - Name: volumeName, - VolumeSource: corev1.VolumeSource{ - EmptyDir: &corev1.EmptyDirVolumeSource{}, - }, - } - - } else if volumeMeta.HostPath != "" { - - newVolume = corev1.Volume{ - Name: volumeName, - VolumeSource: corev1.VolumeSource{ - HostPath: &corev1.HostPathVolumeSource{ - Path: volumeMeta.HostPath, - }, - }, - } - - } else if volumeMeta.SecretName != "" { - newVolume = corev1.Volume{ - Name: volumeName, - VolumeSource: corev1.VolumeSource{ - Secret: &corev1.SecretVolumeSource{ - SecretName: volumeMeta.SecretName, - }, - }, - } - - } else if volumeMeta.ConfigMapName != "" { - - // Prepare items as key to path - items := []corev1.KeyToPath{} - for key, path := range volumeMeta.Items { - newItem := corev1.KeyToPath{ - Key: key, - Path: path, - } - items = append(items, newItem) - } - - // This is a config map volume with items - newVolume = corev1.Volume{ - Name: volumeName, - VolumeSource: corev1.VolumeSource{ - ConfigMap: &corev1.ConfigMapVolumeSource{ - LocalObjectReference: corev1.LocalObjectReference{ - Name: volumeMeta.ConfigMapName, - }, - Items: items, - }, - }, - } - - } else { - - // Fall back to persistent volume claim - newVolume = corev1.Volume{ - Name: volumeName, - VolumeSource: corev1.VolumeSource{ - PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{ - ClaimName: volumeMeta.ClaimName, - }, - }, - } + for _, volume := range vs { + // If the volume doesn't have a name, it was added to the metrics_operator namespace + if volume.Volume.Name == "" { + continue } - volumes = append(volumes, newVolume) + logger.Infof("Adding volume %s\n", &volume.Volume) + volumes = append(volumes, volume.Volume) } + logger.Infof("Volumes %s\n", volumes) return volumes } diff --git a/pkg/specs/specs.go b/pkg/specs/specs.go new file mode 100644 index 0000000..d2e3d89 --- /dev/null +++ b/pkg/specs/specs.go @@ -0,0 +1,79 @@ +/* +Copyright 2022-2023 Lawrence Livermore National Security, LLC + (c.f. AUTHORS, NOTICE.LLNS, COPYING) + +SPDX-License-Identifier: MIT +*/ + +package specs + +import ( + "fmt" + "path/filepath" + "strings" + + api "github.com/converged-computing/metrics-operator/api/v1alpha2" + corev1 "k8s.io/api/core/v1" +) + +// Specs are used to generate configurations for containers and volumes of +// a jobset before we finalize their creation + +// A ContainerSpec is used by a metric to define a container +// Job name and container name allow us to associate a script with a replicated job +type ContainerSpec struct { + JobName string + Image string + Name string + WorkingDir string + EntrypointScript EntrypointScript + + // If a command is provided, it's likely an addon (and EntrypointScript is ignored) + Command []string + + // Does the Container spec need to be written to our set of config maps? + NeedsWrite bool + + Resources *api.ContainerResources + Attributes *api.ContainerSpec +} + +// VolumeSpec includes one or more volumes and mount, etc. location +type VolumeSpec struct { + Volume corev1.Volume + ReadOnly bool + Path string + Mount bool +} + +// Named entrypoint script for a container +type EntrypointScript struct { + Name string + Path string + Script string + + // Pre block typically includes metadata + Pre string + + // This is the main command, provided in case we need to wrap it in something + Command string + + // Anything after the command! + Post string +} + +// WriteScript writes the final script, combining the pre, command, and post +func (e EntrypointScript) WriteScript() string { + return fmt.Sprintf("%s\n%s\n%s\n", e.Pre, e.Command, e.Post) + +} + +// Given a full path, derive the key from the script name minus the extension +func DeriveScriptKey(path string) string { + + // Basename + path = filepath.Base(path) + + // Remove the extension, and this assumes we don't have double . + return strings.Split(path, ".")[0] +} diff --git a/script/test.sh b/script/test.sh old mode 100755 new mode 100644 diff --git a/sdk/python/v1alpha1/.gitignore b/sdk/python/v1alpha2/.gitignore similarity index 100% rename from sdk/python/v1alpha1/.gitignore rename to sdk/python/v1alpha2/.gitignore diff --git a/sdk/python/v1alpha1/CHANGELOG.md b/sdk/python/v1alpha2/CHANGELOG.md similarity index 100% rename from sdk/python/v1alpha1/CHANGELOG.md rename to sdk/python/v1alpha2/CHANGELOG.md diff --git a/sdk/python/v1alpha1/MANIFEST.in b/sdk/python/v1alpha2/MANIFEST.in similarity index 100% rename from sdk/python/v1alpha1/MANIFEST.in rename to sdk/python/v1alpha2/MANIFEST.in diff --git a/sdk/python/v1alpha1/README.md b/sdk/python/v1alpha2/README.md similarity index 100% rename from sdk/python/v1alpha1/README.md rename to sdk/python/v1alpha2/README.md diff --git a/sdk/python/v1alpha1/metricsoperator/__init__.py b/sdk/python/v1alpha2/metricsoperator/__init__.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/__init__.py rename to sdk/python/v1alpha2/metricsoperator/__init__.py diff --git a/sdk/python/v1alpha1/metricsoperator/client.py b/sdk/python/v1alpha2/metricsoperator/client.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/client.py rename to sdk/python/v1alpha2/metricsoperator/client.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/__init__.py b/sdk/python/v1alpha2/metricsoperator/metrics/__init__.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/__init__.py rename to sdk/python/v1alpha2/metricsoperator/metrics/__init__.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/app/__init__.py b/sdk/python/v1alpha2/metricsoperator/metrics/app/__init__.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/app/__init__.py rename to sdk/python/v1alpha2/metricsoperator/metrics/app/__init__.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/app/amg.py b/sdk/python/v1alpha2/metricsoperator/metrics/app/amg.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/app/amg.py rename to sdk/python/v1alpha2/metricsoperator/metrics/app/amg.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/app/lammps.py b/sdk/python/v1alpha2/metricsoperator/metrics/app/lammps.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/app/lammps.py rename to sdk/python/v1alpha2/metricsoperator/metrics/app/lammps.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/base.py b/sdk/python/v1alpha2/metricsoperator/metrics/base.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/base.py rename to sdk/python/v1alpha2/metricsoperator/metrics/base.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/network/__init__.py b/sdk/python/v1alpha2/metricsoperator/metrics/network/__init__.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/network/__init__.py rename to sdk/python/v1alpha2/metricsoperator/metrics/network/__init__.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/network/netmark.py b/sdk/python/v1alpha2/metricsoperator/metrics/network/netmark.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/network/netmark.py rename to sdk/python/v1alpha2/metricsoperator/metrics/network/netmark.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/network/osu_benchmark.py b/sdk/python/v1alpha2/metricsoperator/metrics/network/osu_benchmark.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/network/osu_benchmark.py rename to sdk/python/v1alpha2/metricsoperator/metrics/network/osu_benchmark.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/perf.py b/sdk/python/v1alpha2/metricsoperator/metrics/perf.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/perf.py rename to sdk/python/v1alpha2/metricsoperator/metrics/perf.py diff --git a/sdk/python/v1alpha1/metricsoperator/metrics/storage.py b/sdk/python/v1alpha2/metricsoperator/metrics/storage.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/metrics/storage.py rename to sdk/python/v1alpha2/metricsoperator/metrics/storage.py diff --git a/sdk/python/v1alpha1/metricsoperator/utils.py b/sdk/python/v1alpha2/metricsoperator/utils.py similarity index 100% rename from sdk/python/v1alpha1/metricsoperator/utils.py rename to sdk/python/v1alpha2/metricsoperator/utils.py diff --git a/sdk/python/v1alpha1/pyproject.toml b/sdk/python/v1alpha2/pyproject.toml similarity index 100% rename from sdk/python/v1alpha1/pyproject.toml rename to sdk/python/v1alpha2/pyproject.toml diff --git a/sdk/python/v1alpha1/setup.py b/sdk/python/v1alpha2/setup.py similarity index 97% rename from sdk/python/v1alpha1/setup.py rename to sdk/python/v1alpha2/setup.py index f976abf..a2ccf37 100644 --- a/sdk/python/v1alpha1/setup.py +++ b/sdk/python/v1alpha2/setup.py @@ -30,14 +30,14 @@ if __name__ == "__main__": setup( name="metricsoperator", - version="0.0.21", + version="0.1.0", author="Vanessasaurus", author_email="vsoch@users.noreply.github.com", maintainer="Vanessasaurus", packages=find_packages(), include_package_data=True, zip_safe=False, - url="https://github.com/converged-computing/metrics-operator/tree/main/python-sdk/v1alpha1", + url="https://github.com/converged-computing/metrics-operator/tree/main/python-sdk/v1alpha2", license="MIT", description=DESCRIPTION, long_description=LONG_DESCRIPTION, diff --git a/setup.cfg b/setup.cfg index d60ba9d..a105d5a 100644 --- a/setup.cfg +++ b/setup.cfg @@ -4,5 +4,5 @@ max-line-length = 100 ignore = E1 E2 E5 W5 per-file-ignores = docs/conf.py:E501 - sdk/python/v1alpha1/metricsoperator/metrics/network/__init__.py:F401 - sdk/python/v1alpha1/metricsoperator/metrics/app/__init__.py:F401 + sdk/python/v1alpha2/metricsoperator/metrics/network/__init__.py:F401 + sdk/python/v1alpha2/metricsoperator/metrics/app/__init__.py:F401