Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update compute resources to account for MCAD and InstaScale #305

Merged
merged 4 commits into from
Oct 10, 2023
Merged

Update compute resources to account for MCAD and InstaScale #305

merged 4 commits into from
Oct 10, 2023

Conversation

astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Sep 25, 2023

Following up #216, this PR update the operator compute resources so it better accounts for both MCAD and InstaScale controllers requirements.

It configures the resources so that the operator is assigned with the guaranteed QoS class, and with enough resources, so it performs acceptably by default.

Closes #280.

requests:
cpu: 10m
memory: 64Mi
cpu: "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 CPU is quite high for two controllers, is it really needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've been reluctant, given our current CI environment is short on CPU, but MCAD has not the reputation to be lightweight, and its tests assume 2 CPUs. It may explain why no CPU requirements were specified for MCAD in the previous operator design.

I also think it may not be a good practice to drive these requirements by the limitation of our current CI environment. So they may have to be configured, depending on the environment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking whether it may have sense to reduce the request value.
This helps with resource usage for not intensive cases, while keeping the limit high enough. On the other side it can affect pod eviction order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to have MCAD configured with the guaranteed QoS class, and with enough resources, so it performs acceptably by default.

With that in mind, I've added the extra test configuration so it can still run within the limited resources of GH Actions runners.

@astefanutti
Copy link
Contributor Author

/assign @anishasthana @sutaakar @KPostOffice

Copy link
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Sep 29, 2023
@jbusche
Copy link
Collaborator

jbusche commented Oct 4, 2023

@astefanutti, It's interesting... I first tried control runs on a default codeflare-operator:v1.0.0-rc.1 installation and then modified the csv to adjust to 1 cpu and increasing the memory limit to 1Gi. I didn't notice a change in performance. Is it possible that maybe I need to change elsewhere maybe? Perhaps I'm getting the subscription.yaml confused with the manager.yaml... I'm not sure. But changing the values didn't seem to change performance results. In my runs I never exceeded 95MiB memory and 154m cpu.

AppWrappers Total Time (seconds)     MCAD Image           Cluster Info                                  Comments
5       36 seconds      codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
20      92 seconds      codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
50      306 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
100     591 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
100     558 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem
200     1261 seconds    codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Peak is 89MiB memory and 150m cpu observed
                                
5       36 seconds      codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
20      123 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
50      315 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
100     608 seconds     codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
200     1262 seconds    codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem Adjusting the csv memory to 1Gi, 1 CPU limit and 1 cpu request
300     1790 seconds    codeflare-operator:v1.0.0-rc.1  Fyre FIPS MEDIUM OC 412 3 Nodes, 8 cpu/16GB mem peak is about 95MiB memory and 154m cpu observed

@sutaakar
Copy link
Contributor

sutaakar commented Oct 4, 2023

I guess the MCAD is not limited by resources but rather by something internal.

@astefanutti
Copy link
Contributor Author

@astefanutti, It's interesting... I first tried control runs on a default codeflare-operator:v1.0.0-rc.1 installation and then modified the csv to adjust to 1 cpu and increasing the memory limit to 1Gi. I didn't notice a change in performance. Is it possible that maybe I need to change elsewhere maybe? Perhaps I'm getting the subscription.yaml confused with the manager.yaml... I'm not sure. But changing the values didn't seem to change performance results. In my runs I never exceeded 95MiB memory and 154m cpu.

@jbusche the resource requests / limits must be set in the Subscription, not the CSV, otherwise they'll get overwritten. One way to make sure, is to modify the Subscription and check the Deployment gets eventually updated accordingly.

Copy link
Contributor

@dimakis dimakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dimakis
Copy link
Contributor

dimakis commented Oct 10, 2023

/approve

1 similar comment
@sutaakar
Copy link
Contributor

/approve

@openshift-ci
Copy link

openshift-ci bot commented Oct 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dimakis, sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot merged commit 2ed79fb into project-codeflare:main Oct 10, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include MCAD and InstaScale compute resources requirements
6 participants