add model-server/estimator to KeplerInternal #322

sunya-ch · 2023-12-07T08:30:35Z

This PR replaces #235 by moving the integration to Kepler-internal API.

Change summary:

Add Estimator and ModelServer in KeplerInternalSpec and KeplerInternalStatus
Add components/estimator
Implement components/model-server when enabled
Add Model Server Reconcilers to kepler-internal Reconciler
Modify components/exporter to add estimator sidecar if set
Add access Role for deployments and persistentvolumeclaims
Add AddIfNotEmpty and VolumeFromEmptyDir utility function (used for Estimator and ModelServer creation)

Bug fixes (not related to model server):

Correct reference for UP-TO-DATE status
Replace ki.namespace in updateStatus with the deployment namespace from spec.

KeplerInternal

Here is the CR that I used for running in my local cluster.

apiVersion: kepler.system.sustainable.computing.io/v1alpha1
kind: KeplerInternal
metadata:
  annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf
  labels:
    app.kubernetes.io/name: kepler
    app.kubernetes.io/instance: kepler
    app.kubernetes.io/part-of: kepler-operator
  name: kepler
spec:
  exporter:
    deployment:
      image: quay.io/sustainable_computing_io/kepler:release-0.6.1-libbpf
      namespace: kepler-operator
  openshift:
    enabled: true
    dashboard:
      enabled: true
  modelServer:
    enabled: true
  estimator:
    node:
      components:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
      total:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip

KeplerInternal Status

With neither estimator nor modelserver

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE    READY   AVAILABLE   AGE   IMAGE   ESTIMATOR      MODEL-SERVER
kepler   9103   7         7         7             7       7           83s   <abbr>  NotInstalled   NotInstalled

With only estimator

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE    READY   AVAILABLE   AGE   IMAGE   ESTIMATOR      MODEL-SERVER
kepler   9103   7         7         7             7       7           40m   <abbr>  Running        NotInstalled

With estimator+modelserver

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE     READY   AVAILABLE   AGE   IMAGE    ESTIMATOR     MODEL-SERVER
kepler   9103   7         7         7              7       7           15s   <abbr>   Running        Running

Signed-off-by: Sunyanan Choochotkaew [email protected]

pkg/api/v1alpha1/kepler_internal_types.go

sthaha · 2023-12-07T14:25:18Z

pkg/components/exporter/exporter.go

+		if ms.Enabled {
+			exporterConfigMap["MODEL_SERVER_ENABLE"] = "true"
+		}
+		modelServerConfig := modelserver.ModelServerConfigForClient(k.ModelServerDeploymentName(), k.Spec.ModelServer)


Suggested change

modelServerConfig := modelserver.ModelServerConfigForClient(k.ModelServerDeploymentName(), k.Spec.ModelServer)

modelServerConfig := modelserver.ConfigForClient(k.ModelServerDeploymentName(), k.Spec.ModelServer)

How about renaming modelserver.ModelServer<X> to modelserver.<X> to avoid stuttering ?

I wonder if k.Spec.ModelServer.ClientConfig() is better since all the function only requires the spec from Modelserver to compute client config.

Yes..it is the config for client of the model server (i.e., exporter/estimator container).

@sthaha Sorry, I misunderstood your comment.
Just change the method name: https://github.com/sustainable-computing-io/kepler-operator/compare/5fa1b54bc0dc3b2f68894af0cc89b7341f8a80f1..6209dadf4f6f04ed3e786989f062f71459dd06f2

Note: cannot have k.Spec.ModelServer.ClientConfig()
I cannot have (ms *v1alpha1.InternalModelServerSpec) ConfigForClient, it needs to be inside v1alpha1.
And, I cannot move the method to v1alpha1 since it needs k8s module and k8s module calls v1alpha1 (circle call)

@sthaha Please confirm whether the change resolved?

sthaha · 2023-12-07T14:35:33Z

@sunya-ch does this support deployment of multiple model-servers like it currently allows multiple kepler-internals

May be worth considering is .. if it makes sense to have a separate CRD for KeplerModelServer. This would allow for multiple model servers to be deployed (if that is even a case) with different configs.
kepler-internal can have a spec.modelserver.ref to refer to the KeplerModelServer

sunya-ch · 2023-12-07T14:47:12Z

@sunya-ch does this support deployment of multiple model-servers like it currently allows multiple kepler-internals

May be worth considering is .. if it makes sense to have a separate CRD for KeplerModelServer. This would allow for multiple model servers to be deployed (if that is even a case) with different configs. kepler-internal can have a spec.modelserver.ref to refer to the KeplerModelServer

Yes, I generate model server name in the same way of the kepler exporter name (based on CR + suffix of model-server). I think we can keep it together for simplicity of deployment. Each export can connect to different model server. The model server will be created only when it is enabled. But, kepler can specify only model server URL to connect to the other model server (including external).

sthaha · 2023-12-08T02:55:17Z

@sunya-ch Overall the code looks good to me one missing part is the e2e test.
Could you please also add a simple e2e test for validating deployment of model server?

pkg/controllers/kepler_internal.go

sunya-ch · 2023-12-08T03:48:59Z

@sunya-ch Overall the code looks good to me one missing part is the e2e test.
Could you please also add a simple e2e test for validating deployment of model server?

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?
We can just check the status of KeplerInternal on .status.modelServer.status and .status.estimator.status.

See the issue open here: #314

sthaha · 2023-12-08T04:26:51Z

@sunya-ch

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?

The current Kepler tests cover almost all the usecase currently supported by kepler-internal, so having a set of tests that replicate what creation of kepler already does gives us only a low ROI.

IMHO, all features should be be accompanied by tests that validate most common usecases. It shouldn't be too hard to add an e2e by making a copy of the existing kepler-e2e.

sthaha · 2023-12-08T04:27:54Z

@sunya-ch ,

 annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf

that isn't required for kepler-internal it is only a hack enabled for kepler so that stable API users have the ability to deploy libbpf image which are kernel agnostic.

sunya-ch · 2023-12-08T04:50:29Z

@sunya-ch ,
 annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf
that isn't required for kepler-internal it is only a hack enabled for kepler so that stable API users have the ability to deploy libbpf image which are kernel agnostic.

I see. I have it because first I tried to create with specifying the image name but it turns out that it is not allowed for keplerinternal. Just didn't remove it out ;)

sunya-ch · 2023-12-08T04:51:37Z

@sunya-ch

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?

The current Kepler tests cover almost all the usecase currently supported by kepler-internal, so having a set of tests that replicate what creation of kepler already does gives us only a low ROI.

IMHO, all features should be be accompanied by tests that validate most common usecases. It shouldn't be too hard to add an e2e by making a copy of the existing kepler-e2e.

Yes, it might be but I still think it should be on another PR for better track. I can rebase this PR from the PR.

sthaha · 2023-12-08T08:05:16Z

@sunya-ch Hopefull #325 should help with the model-server testing 👼

sunya-ch · 2023-12-08T09:36:56Z

convert to draft building on top of the PR #325.
Will rebase with v1alpha1 branch when that PR has merged.

sunya-ch · 2023-12-08T11:53:17Z

@sthaha Sorry for multiple force-pushed.
Most of the main code is not changed the failure comes from my bad code on test case.

Additional change are adding e2e test case and Enabled() function for InternalEstimatorSpec.

kepler-operator/pkg/api/v1alpha1/kepler_internal_types.go

Line 108 in a4177f9

func (e InternalEstimatorSpec) Enabled() bool {

tests/e2e/kepler_internal_test.go

pkg/api/v1alpha1/kepler_internal_types.go

pkg/components/estimator/estimator.go

sthaha · 2023-12-11T03:45:40Z

pkg/components/exporter/exporter.go

+		k8s.VolumeFromHost("lib-modules", "/lib/modules"),
+		k8s.VolumeFromHost("tracing", "/sys"),
+		k8s.VolumeFromHost("proc", "/proc"),
+		k8s.VolumeFromHost("kernel-src", "/usr/src/kernels"),
+		k8s.VolumeFromHost("kernel-debug", "/sys/kernel/debug"),


we need a better way to handling volumes. - In another PR

E.g. each New<X>Container can return []NamedMount.

type NamedMount string const ( HostLibModulesMount NamedMount = "host-lib-modules" HostProc = "host-proc" KeplerConfigMapMount = "cm-kepler" ) func (m HostMount) Volume() corev1.Volume { mounts := map[HostMount]string{ LibModulesMount: "/lib/modules", ... } if strings.StartsWith("host-", m) { return k8s.VolumeFromHost(m, mounts[m]) } else if strings.StartsWith("cm-", m) return k8s.VolumeFromConfigMap(m, mounts[m]) } }

pkg/components/modelserver/modelserver_test.go

pkg/reconciler/runner.go

pkg/utils/test/assertions.go

sthaha · 2023-12-11T03:58:12Z

pkg/utils/test/assertions.go

+
+func (f Framework) AssertInternalStatus(ki *v1alpha1.KeplerInternal) {
+	// the status will be updated
+	ki = f.WaitUntilInternalCondition(ki.Name, v1alpha1.Reconciled, v1alpha1.ConditionTrue)


Suggested change

ki = f.WaitUntilInternalCondition(ki.Name, v1alpha1.Reconciled, v1alpha1.ConditionTrue)

ki = f.WaitUntilInternalCondition(name, v1alpha1.Reconciled, v1alpha1.ConditionTrue)

sthaha · 2023-12-11T04:00:32Z

@vprashar2929 could you please help validate this on OpenShift ?.. just ensuring the supported usecase of creating a kepler works is good enough.

vprashar2929 · 2023-12-11T17:40:05Z

Couple of observations while testing this on OpenShift(4.13):

When deploying KeplerInternals on OpenShift, there is still an issue with showing the appropriate status of Kepler

oc get keplerinternals.kepler.system.sustainable.computing.io                                                                               
NAME           PORT   DESIRED   CURRENT   UP-TO-DATE   READY   AVAILABLE   AGE    IMAGE                                  ESTIMATOR   MODEL-SERVER
mykepler-101   9103   0         0                      0                   168m   quay.io/rh_ee_vprashar/kepler:latest

When the estimator is deployed along with the model server it checks for the wrong model-server service name: mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local openshift-kepler-operator ns doesn't exist

set NODE_TOTAL_ESTIMATOR to true.
set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip.
clean socket
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca34a02430>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/acpi/AbsPower
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca33e17460>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/rapl/AbsPower

sunya-ch · 2023-12-12T00:23:46Z

Couple of observations while testing this on OpenShift(4.13):

When deploying KeplerInternals on OpenShift, there is still an issue with showing the appropriate status of Kepler

oc get keplerinternals.kepler.system.sustainable.computing.io                                                                               
NAME           PORT   DESIRED   CURRENT   UP-TO-DATE   READY   AVAILABLE   AGE    IMAGE                                  ESTIMATOR   MODEL-SERVER
mykepler-101   9103   0         0                      0                   168m   quay.io/rh_ee_vprashar/kepler:latest

When the estimator is deployed along with the model server it checks for the wrong model-server service name: mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local openshift-kepler-operator ns doesn't exist

set NODE_TOTAL_ESTIMATOR to true.
set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip.
clean socket
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca34a02430>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/acpi/AbsPower
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca33e17460>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/rapl/AbsPower

@vprashar2929 Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

sthaha · 2023-12-12T00:58:26Z

pkg/utils/test/framework.go

+			if errors.IsNotFound(err) {
+				return true, fmt.Errorf("kepler-internal %s is not found", name)
+			}
+			statusOK := true


Suggested change

statusOK := true

var statusOK bool

vprashar2929 · 2023-12-12T09:13:43Z

Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

Should we use the namespace from keplerinternal spec ?

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

So when I created keplerinternals I used a different namespace from kepler-operator AFAIK controller only watches kepler-operator ns. If you create an instance with the different namespace to kepler-operator then you won't be able to see the status. This is a known issue and we are planning to fix it by adding config-map #312

sthaha · 2023-12-13T01:48:33Z

@sunya-ch

I see the following difference with the install of kepler using kepler CRD from v1alpha branch vs this PR. Ideally we require that optional kepler-internal features do not introduce any difference.

However, I am okay with having this if you could confirm that there is no impact or fixes an issue

configmap

Signed-off-by: Sunyanan Choochotkaew <[email protected]>

sunya-ch · 2023-12-13T02:01:28Z

@sunya-ch

I see the following difference with the install of kepler using kepler CRD from v1alpha branch vs this PR. Ideally we require that optional kepler-internal features do not introduce any difference.

However, I am okay with having this if you could confirm that there is no impact or fixes an issue

configmap

Yes, it has no affect because the default is false. Only enable when it is set to true. (reference: https://github.com/sustainable-computing-io/kepler/blob/442bcfe5d3bf26a1285ff0f13d1f5017d28b9e37/pkg/model/model.go#L189)

sunya-ch · 2023-12-13T02:06:18Z

Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

Should we use the namespace from keplerinternal spec ?

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

So when I created keplerinternals I used a different namespace from kepler-operator AFAIK controller only watches kepler-operator ns. If you create an instance with the different namespace to kepler-operator then you won't be able to see the status. This is a known issue and we are planning to fix it by adding config-map #312

Finally, I guess so, we may allow Kepler to deploy on any namespace for keplerinternal.
I put TODO in this PR: https://github.com/sustainable-computing-io/kepler-operator/pull/324/files#diff-435eecb1d2af40ee96747f88ab71eb2758da989c92f76dcb1f714fc1b300c633

We need to have Kepler-operator add the new namespace to the cache. Now, we have to rely on the additional namespace list in the command line. Kepler deployer has to know in advance the namespace list where they are going to allow the keplerinternal to be installed.

kepler-operator/cmd/manager/main.go

Line 90 in 73505af

flag.CommandLine.Var(flag.Value(&additionalNamespaces), "watch-namespaces",

sthaha

Looks great! Thanks a lot @sunya-ch 🤗

sunya-ch mentioned this pull request Dec 7, 2023

update model server support #235

Closed

sunya-ch requested a review from sthaha December 7, 2023 08:32

sunya-ch force-pushed the model-server-internal branch from 8ff6a09 to f444a30 Compare December 7, 2023 09:07