Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-46426: e2e: add irdma to module_blacklist kernel args #1249

Merged
merged 1 commit into from
Dec 18, 2024

Conversation

rbaturov
Copy link
Contributor

@rbaturov rbaturov commented Dec 10, 2024

This implements a workaround to prevent CI failures on specific hardware using an Intel E810 network card.
When UserLevelNetworking is set to True, tuned attempts to set the combined channel count equal to the reserved CPUs but fails with the following error:
tuned.utils.commands: Executing 'ethtool -L ens2f0 combined 1' error: netlink error: Device or resource busy
The error occurs because the ice driver: ens2f0: Cannot change channels when RDMA is active.
This issue causes the tuned profile to degrade.
As a temporary solution, by adding 'module_blacklist=irdma' to the kernel Args we will block RDMA, to avoid these errors.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2024
Copy link
Contributor

openshift-ci bot commented Dec 10, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rbaturov
Copy link
Contributor Author

/test all

@rbaturov
Copy link
Contributor Author

/test ci/prow/e2e-hypershift-pao ci/prow/e2e-hypershift

Copy link
Contributor

openshift-ci bot commented Dec 15, 2024

@rbaturov: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-aws-ovn-techpreview
/test e2e-gcp-pao
/test e2e-gcp-pao-updating-profile
/test e2e-gcp-pao-workloadhints
/test e2e-hypershift
/test e2e-hypershift-pao
/test e2e-no-cluster
/test e2e-pao-updating-profile-hypershift
/test e2e-upgrade
/test images
/test lint
/test unit
/test verify
/test vet

The following commands are available to trigger optional jobs:

/test e2e-telco5g-cnftests
/test okd-scos-e2e-aws-ovn
/test okd-scos-images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-operator
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-ovn
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-ovn-techpreview
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao-updating-profile
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao-workloadhints
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-hypershift
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-hypershift-pao
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-no-cluster
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade
pull-ci-openshift-cluster-node-tuning-operator-master-images
pull-ci-openshift-cluster-node-tuning-operator-master-lint
pull-ci-openshift-cluster-node-tuning-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-cluster-node-tuning-operator-master-unit
pull-ci-openshift-cluster-node-tuning-operator-master-verify
pull-ci-openshift-cluster-node-tuning-operator-master-vet

In response to this:

/test ci/prow/e2e-hypershift-pao ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rbaturov
Copy link
Contributor Author

/test e2e-hypershift e2e-hypershift-pao

@rbaturov rbaturov changed the title e2e: add irdma to module_blacklist kernel args OCPBUGS-46426: e2e: add irdma to module_blacklist kernel args Dec 15, 2024
@rbaturov rbaturov marked this pull request as ready for review December 15, 2024 14:07
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 15, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Dec 15, 2024
@openshift-ci-robot
Copy link
Contributor

@rbaturov: This pull request references Jira Issue OCPBUGS-46426, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This implements a workaround to prevent CI failures on specific hardware using an Intel E810 network card.
When UserLevelNetworking is set to True, tuned attempts to set the combined channel count equal to the reserved CPUs but fails with the following error:
tuned.utils.commands: Executing 'ethtool -L ens2f0 combined 1' error: netlink error: Device or resource busy
The error occurs because the ice driver: ens2f0: Cannot change channels when RDMA is active.
This issue causes the tuned profile to degrade.
As a temporary solution, by adding 'module_blacklist=irdma' to the kernel Args we will block RDMA, to avoid these errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from swatisehgal and yanirq December 15, 2024 14:08
@rbaturov
Copy link
Contributor Author

/test e2e-hypershift

This implements a workaround to prevent CI failures on specific hardware using an Intel E810 network card.
When UserLevelNetworking is set to True, tuned attempts to set the combined channel count equal to the reserved CPUs but fails with the following error:
tuned.utils.commands: Executing 'ethtool -L ens2f0 combined 1' error: netlink error: Device or resource busy
The error occurs because the ice driver: ens2f0: Cannot change channels when RDMA is active.
This issue causes the tuned profile to degrade.
As a temporary solution, by adding 'module_blacklist=irdma' to the kernel Args we will block RDMA, to avoid these errors.
Reference: OCPBUGS-46426

Signed-off-by: Ronny Baturov <[email protected]>
Copy link
Contributor

@jmencak jmencak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from a minor nit, what is the impact of blacklisting module_blacklist=irdma and do we really need/want this? I believe, at a minimum, we probably want to state what the impact is in the commit log.

@jmencak
Copy link
Contributor

jmencak commented Dec 16, 2024

OK, this is just e2e tests.
/approve
/lgtm
/hold
to give other a chance to comment. Feel free to unhold if you don't get any feedback.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 16, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2024
Copy link
Contributor

openshift-ci bot commented Dec 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak, rbaturov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2024
@MarSik
Copy link
Contributor

MarSik commented Dec 18, 2024

/lgtm

@ffromani
Copy link
Contributor

/lgtm

@rbaturov
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2024
@rbaturov
Copy link
Contributor Author

/cherry-pick release-4.18 release-4.17 release-4.16

@openshift-cherrypick-robot

@rbaturov: once the present PR merges, I will cherry-pick it on top of release-4.18 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.18 release-4.17 release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d24467f and 2 for PR HEAD a2678de in total

@rbaturov
Copy link
Contributor Author

/retest

Copy link
Contributor

openshift-ci bot commented Dec 18, 2024

@rbaturov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn a2678de link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit c0171d3 into openshift:master Dec 18, 2024
16 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@rbaturov: Jira Issue OCPBUGS-46426: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-46426 has been moved to the MODIFIED state.

In response to this:

This implements a workaround to prevent CI failures on specific hardware using an Intel E810 network card.
When UserLevelNetworking is set to True, tuned attempts to set the combined channel count equal to the reserved CPUs but fails with the following error:
tuned.utils.commands: Executing 'ethtool -L ens2f0 combined 1' error: netlink error: Device or resource busy
The error occurs because the ice driver: ens2f0: Cannot change channels when RDMA is active.
This issue causes the tuned profile to degrade.
As a temporary solution, by adding 'module_blacklist=irdma' to the kernel Args we will block RDMA, to avoid these errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@rbaturov: new pull request created: #1257

In response to this:

/cherry-pick release-4.18 release-4.17 release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-node-tuning-operator
This PR has been included in build cluster-node-tuning-operator-container-v4.19.0-202412181837.p0.gc0171d3.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants