Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix:incomplete arp cause busy loop (high cpu util) #7421

Merged
merged 6 commits into from
Apr 20, 2023

Conversation

detailyang
Copy link
Contributor

@detailyang detailyang commented Mar 3, 2023

Description

Hello.
we have identified an issue with incomplete ARP entries that may cause a busy loop for calico-node in production.

We are unsure why every node is generating incomplete ARP entries, as shown in the attached image
image

These entries are causing the calico-node to experience high CPU utilization and become stuck in the syncL2RoutesForLink process. Specifically, the incomplete ARP entry, which holds a nil MAC address, is causing the kernel function rtnl_fdb_del to fail when attempting to remove the FDB entry

https://elixir.bootlin.com/linux/v3.10.108/source/net/core/rtnetlink.c#L2223

As a workaround, we can choose to ignore the incomplete ARP entries for now and continue with normal operations :)

fixes #5460

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Fix high CPU usage in syncL2RoutesForLink: ignore incomplete ARP entries when cleaning up the FDB table. Prevents us from telling the kernel to delete an FDB entry with no HwAddr, which fails triggering a retry loop.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Incomplete ARP entry (nil mac address)can cause the busy loop
@detailyang detailyang requested a review from a team as a code owner March 3, 2023 02:45
@marvin-tigera marvin-tigera added this to the Calico v3.26.0 milestone Mar 3, 2023
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Mar 3, 2023
@CLAassistant
Copy link

CLAassistant commented Mar 3, 2023

CLA assistant check
All committers have signed the CLA.

@detailyang
Copy link
Contributor Author

@mgleung Can you have a look this, many thanks:)

@detailyang
Copy link
Contributor Author

cc @StevenTigera
Can you have a look this, many thanks:)

@detailyang
Copy link
Contributor Author

@fasaxc thanks for reviewing :)

@fasaxc
Copy link
Member

fasaxc commented Mar 16, 2023

/sem-approve

@fasaxc fasaxc added cherry-pick-candidate docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Mar 16, 2023
@marvin-tigera marvin-tigera added docs-pr-required Change is not yet documented and removed docs-pr-required Change is not yet documented labels Mar 16, 2023
@fasaxc
Copy link
Member

fasaxc commented Mar 16, 2023

@detailyang Looks like the linter failed, please can you run make fix in the felix directory and commit that?

@detailyang
Copy link
Contributor Author

detailyang commented Mar 30, 2023

@detailyang Looks like the linter failed, please can you run make fix in the felix directory and commit that?

@fasaxc Can you restart the CI again. It's failed at make check-dirty

See https://tigera.semaphoreci.com/jobs/20627b73-30d4-4312-85da-0010d847a343/plain_logs.txt

Generating manifest from charts/values/./canal-etcd.yaml
walk.go:74: found symbolic link in path: /home/semaphore/calico/charts/calico/crds resolves to /home/semaphore/calico/libcalico-go/config/crd. Contents of linked file included and used
make[2]: Leaving directory '/home/semaphore/calico'
make[1]: Leaving directory '/home/semaphore/calico'
make check-dirty
make[1]: Entering directory '/home/semaphore/calico'
The following files are dirty
 felix/routetable/route_table.go | 4 �[32m++�[m�[31m--�[m
 1 file changed, 2 insertions(+), 2 deletions(-)
make[1]: *** [lib.Makefile:1197: check-dirty] Error 1

@lwr20
Copy link
Member

lwr20 commented Mar 30, 2023

/sem-approve

@fasaxc
Copy link
Member

fasaxc commented Apr 3, 2023

@detailyang did you commit/push the result of make fix; it just formats the code, you still need to commit its changes.

@detailyang
Copy link
Contributor Author

@detailyang did you commit/push the result of make fix; it just formats the code, you still need to commit its changes.

fixed.

@frozenprocess
Copy link
Collaborator

/sem-approve

@@ -1035,6 +1035,10 @@ func (r *RouteTable) syncL2RoutesForLink(ifaceName string) error {
var updatesFailed bool

for _, existing := range existingNeigh {
if existing.HardwareAddr == nil {
log.WithField("entry", entry).Debug("Ignoring existing ARP entry with no hardware addr")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI now failing with entry undefined

@detailyang
Copy link
Contributor Author

@fasaxc
Could you please assist me in checking the CI status? I am unable to locate the CI address. Thank you very much.

@JornShen
Copy link

would you backport lower version ?

@fasaxc
Copy link
Member

fasaxc commented Apr 19, 2023

/sem-approve

@fasaxc fasaxc merged commit eab7244 into projectcalico:master Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-candidate docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

warning log constantly occurs in calico-node pod and kernel
7 participants