Update compute resources to account for MCAD and InstaScale #305

astefanutti · 2023-09-25T10:26:52Z

Following up #216, this PR update the operator compute resources so it better accounts for both MCAD and InstaScale controllers requirements.

It configures the resources so that the operator is assigned with the guaranteed QoS class, and with enough resources, so it performs acceptably by default.

Closes #280.

sutaakar · 2023-09-25T10:37:57Z

config/manager/manager.yaml

          requests:
-            cpu: 10m
-            memory: 64Mi
+            cpu: "1"


1 CPU is quite high for two controllers, is it really needed?

Yes, I've been reluctant, given our current CI environment is short on CPU, but MCAD has not the reputation to be lightweight, and its tests assume 2 CPUs. It may explain why no CPU requirements were specified for MCAD in the previous operator design.

I also think it may not be a good practice to drive these requirements by the limitation of our current CI environment. So they may have to be configured, depending on the environment.

Thinking whether it may have sense to reduce the request value.
This helps with resource usage for not intensive cases, while keeping the limit high enough. On the other side it can affect pod eviction order.

I'd be inclined to have MCAD configured with the guaranteed QoS class, and with enough resources, so it performs acceptably by default.

With that in mind, I've added the extra test configuration so it can still run within the limited resources of GH Actions runners.

astefanutti · 2023-09-28T07:43:41Z

/assign @anishasthana @sutaakar @KPostOffice

sutaakar

/lgtm

jbusche · 2023-10-04T05:13:19Z

@astefanutti, It's interesting... I first tried control runs on a default codeflare-operator:v1.0.0-rc.1 installation and then modified the csv to adjust to 1 cpu and increasing the memory limit to 1Gi. I didn't notice a change in performance. Is it possible that maybe I need to change elsewhere maybe? Perhaps I'm getting the subscription.yaml confused with the manager.yaml... I'm not sure. But changing the values didn't seem to change performance results. In my runs I never exceeded 95MiB memory and 154m cpu.

AppWrappers Total Time (seconds)     MCAD Image           Cluster Info                                  Comments
5       36 seconds      codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
20      92 seconds      codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
50      306 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
100     591 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
100     558 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
200     1261 seconds    codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Peak is 89MiB memory and 150m cpu observed
                                
5       36 seconds      codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
20      123 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
50      315 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
100     608 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
200     1262 seconds    codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
300     1790 seconds    codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem peak is about 95MiB memory and 154m cpu observed

sutaakar · 2023-10-04T08:04:23Z

I guess the MCAD is not limited by resources but rather by something internal.

astefanutti · 2023-10-05T13:22:11Z

@astefanutti, It's interesting... I first tried control runs on a default codeflare-operator:v1.0.0-rc.1 installation and then modified the csv to adjust to 1 cpu and increasing the memory limit to 1Gi. I didn't notice a change in performance. Is it possible that maybe I need to change elsewhere maybe? Perhaps I'm getting the subscription.yaml confused with the manager.yaml... I'm not sure. But changing the values didn't seem to change performance results. In my runs I never exceeded 95MiB memory and 154m cpu.

@jbusche the resource requests / limits must be set in the Subscription, not the CSV, otherwise they'll get overwritten. One way to make sure, is to modify the Subscription and check the Deployment gets eventually updated accordingly.

dimakis

lgtm

dimakis · 2023-10-10T08:44:32Z

/approve

sutaakar · 2023-10-10T09:53:41Z

/approve

openshift-ci · 2023-10-10T09:53:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dimakis, sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sutaakar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Update compute resources to account for MCAD and InstaScale

4b6ffea

openshift-ci bot requested review from anishasthana and jbusche September 25, 2023 10:26

sutaakar reviewed Sep 25, 2023

View reviewed changes

astefanutti added 3 commits September 25, 2023 15:45

Patch Deployment resources for e2e tests

80eedbd

Overwrite Subscription resources in OLM e2e test

b092dfd

e2e: Upgrade OLM to version v0.25.0

f728e93

openshift-ci bot assigned anishasthana, KPostOffice and sutaakar Sep 28, 2023

sutaakar reviewed Sep 29, 2023

View reviewed changes

openshift-ci bot added the lgtm label Sep 29, 2023

dimakis approved these changes Oct 10, 2023

View reviewed changes

openshift-ci bot assigned dimakis Oct 10, 2023

openshift-ci bot added the approved label Oct 10, 2023

openshift-ci bot merged commit 2ed79fb into project-codeflare:main Oct 10, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update compute resources to account for MCAD and InstaScale #305

Update compute resources to account for MCAD and InstaScale #305

astefanutti commented Sep 25, 2023 •

edited

Loading

sutaakar Sep 25, 2023

astefanutti Sep 25, 2023

sutaakar Sep 25, 2023

astefanutti Sep 28, 2023

astefanutti commented Sep 28, 2023

sutaakar left a comment

jbusche commented Oct 4, 2023

sutaakar commented Oct 4, 2023

astefanutti commented Oct 5, 2023

dimakis left a comment

dimakis commented Oct 10, 2023

sutaakar commented Oct 10, 2023

openshift-ci bot commented Oct 10, 2023

Update compute resources to account for MCAD and InstaScale #305

Update compute resources to account for MCAD and InstaScale #305

Conversation

astefanutti commented Sep 25, 2023 • edited Loading

sutaakar Sep 25, 2023

Choose a reason for hiding this comment

astefanutti Sep 25, 2023

Choose a reason for hiding this comment

sutaakar Sep 25, 2023

Choose a reason for hiding this comment

astefanutti Sep 28, 2023

Choose a reason for hiding this comment

astefanutti commented Sep 28, 2023

sutaakar left a comment

Choose a reason for hiding this comment

jbusche commented Oct 4, 2023

sutaakar commented Oct 4, 2023

astefanutti commented Oct 5, 2023

dimakis left a comment

Choose a reason for hiding this comment

dimakis commented Oct 10, 2023

sutaakar commented Oct 10, 2023

openshift-ci bot commented Oct 10, 2023

astefanutti commented Sep 25, 2023 •

edited

Loading