Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ubuntu: Bump to latest LTS 24.04 #281

Merged
merged 4 commits into from
Jul 16, 2024
Merged

Conversation

sayboras
Copy link
Member

@sayboras sayboras commented Jun 19, 2024

Testing was done in cilium/cilium#33264

sayboras added a commit to cilium/cilium that referenced this pull request Jun 19, 2024
sayboras added a commit to cilium/cilium that referenced this pull request Jun 19, 2024
sayboras added a commit to cilium/cilium that referenced this pull request Jun 20, 2024
@sayboras sayboras marked this pull request as ready for review June 20, 2024 11:01
@sayboras sayboras requested review from a team as code owners June 20, 2024 11:01
@sayboras sayboras requested review from lmb and borkmann June 20, 2024 11:01
@lmb
Copy link

lmb commented Jun 20, 2024

Do you have output for the test failure?

@sayboras
Copy link
Member Author

Do you have output for the test failure?

Yes, it can be found here https://github.com/cilium/cilium/actions/runs/9589413585

@lmb
Copy link

lmb commented Jun 20, 2024

What is the ubuntu base image used for? It's pretty bad if the verifier test fails with a complexity error, so I'm hesitant to just keep the old version to "fix" it. cc @ti-mo seems like maybe we're using a different clang now? i was under the impression that we bake in our own clang?

sayboras added a commit to cilium/cilium that referenced this pull request Jul 14, 2024
sayboras added a commit to cilium/cilium that referenced this pull request Jul 15, 2024
sayboras added a commit to cilium/cilium that referenced this pull request Jul 15, 2024
sayboras added a commit to cilium/cilium that referenced this pull request Jul 15, 2024
This reverts commit 11cfa1a to avoid
the below verifier issue in cilium/cilium upstream

```
    --- FAIL: TestVerifier/bpf_host (111.35s)
        --- FAIL: TestVerifier/bpf_host/1 (24.05s)
        --- FAIL: TestVerifier/bpf_host/2 (21.32s)
	...
```

Sample error log for TestVerifier/bpf_host/1

```
    verifier_test.go:244: Error: program tail_nodeport_nat_ingress_ipv4: load program: argument list too long: BPF program is too large. Processed 1000001 insn (1445 line(s) omitted)
        Verifier error tail: load program: argument list too long:
        	(1436 line(s) omitted)
        	; struct ipv4_ct_tuple icmp_tuple = {
        	1201: (63) *(u32 *)(r10 -12) = r1
        	; .flags = tuple->flags | TUPLE_F_RELATED,
        	1202: (71) r1 = *(u8 *)(r10 -123)
        	; .flags = tuple->flags | TUPLE_F_RELATED,
        	1203: (44) w1 |= 2
        	; struct ipv4_ct_tuple icmp_tuple = {
        	1204: (73) *(u8 *)(r10 -3) = r1
        	BPF program is too large. Processed 1000001 insn
        	processed 1000001 insns (limit 1000000) max_states_per_insn 17 total_states 55755 peak_states 1015 mark_read 59
```

Sample error log for TestVerifier/bpf_host/2
```
    verifier_test.go:244: Error: program tail_nodeport_nat_ingress_ipv4: load program: argument list too long: BPF program is too large. Processed 1000001 insn (1445 line(s) omitted)
        Verifier error tail: load program: argument list too long:
        	(1436 line(s) omitted)
        	; struct ipv4_ct_tuple icmp_tuple = {
        	1201: (63) *(u32 *)(r10 -12) = r1
        	; .flags = tuple->flags | TUPLE_F_RELATED,
        	1202: (71) r1 = *(u8 *)(r10 -123)
        	; .flags = tuple->flags | TUPLE_F_RELATED,
        	1203: (44) w1 |= 2
        	; struct ipv4_ct_tuple icmp_tuple = {
        	1204: (73) *(u8 *)(r10 -3) = r1
        	BPF program is too large. Processed 1000001 insn
        	processed 1000001 insns (limit 1000000) max_states_per_insn 17 total_states 55755 peak_states 1015 mark_read 59
```
sayboras added a commit to cilium/cilium that referenced this pull request Jul 15, 2024
@sayboras
Copy link
Member Author

What is the ubuntu base image used for? It's pretty bad if the verifier test fails with a complexity error, so I'm hesitant to just keep the old version to "fix" it. cc @ti-mo seems like maybe we're using a different clang now? i was under the impression that we bake in our own clang?

The failure is actually due to #279. I think the best way is to revert llvm version bump, and proceed with ubuntu upgrade.

@dylandreimerink
Copy link
Member

The failure is actually due to #279. I think the best way is to revert llvm version bump, and proceed with ubuntu upgrade.

That sucks, I don't like the idea of blindly revering the compiler version, since quite a lot of time has gone into that upgrade. I would like to involve @gentoo-root first before we do that revert.

@sayboras
Copy link
Member Author

The failure is actually due to #279. I think the best way is to revert llvm version bump, and proceed with ubuntu upgrade.

That sucks, I don't like the idea of blindly revering the compiler version, since quite a lot of time has gone into that upgrade. I would like to involve @gentoo-root first before we do that revert.

Yeah, after I found out about clang version, knowing Max's good work, I would have expected that Max was trying to upgrade in cilium/cilium as per below (which is having same failures as my testing). IMO, it's pretty safe to revert LLVM 1.18 commit, and tackle the verifer issue later.

cilium/cilium#32816

@auriaave
Copy link

auriaave commented Jul 16, 2024

Since a previous LLVM version seems to work, I'm not sure about this report: it suggests that MAX_USED_MAPS and MAX_USED_BTFS limits are now set to 64 (e.g. on kernel 5.15).

link to change in the bpf verifier

Copy link
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected that Max was trying to upgrade in cilium/cilium as per below (which is having same failures as my testing). IMO, it's pretty safe to revert LLVM 1.18 commit, and tackle the verifer issue later.

Ah, I was of the believe that we already were on Clang 18 on cilium/cilium, if thats not the case than we can indeed revert to 17 here, for now. Sorry for the noise

@sayboras
Copy link
Member Author

sayboras commented Jul 16, 2024

Since a previous LLVM version seems to work, I'm not sure about this report: it suggests that MAX_USED_MAPS and MAX_USED_BTFS limits are now set to 64 (e.g. on kernel 5.15).

link to change in the bpf verifier

We actually have the error log >1mil statements. Additionally, net-next is actually passed, only 5.10, 5.15 and 6.1 failed.

I did a little bit investigation, it seems to be related to ENABLE_IPV6 flag in cilium bpf code, which triggers some complexity issue.

Full error log can be found in the below GHA.
https://github.com/cilium/cilium/actions/runs/9544509806

@auriaave
Copy link

I looked for similar issues in the tracker and this is what I found:

Interestingly, TestVerifier/bpf_overlay/1 also has ENABLE_IPV6=1 (and the masquerade flag) but it changes with respect to bpf_host/1 in:

  • MAX_HOST_OPTIONS vs MAX_OVERLAY_OPTIONS
  • in the overlay config, LB_SELECTION=1 and LB_SELECTION_MAGLEV=1

@joestringer
Copy link
Member

I created https://github.com/cilium/image-tools/tree/ubuntu-22.04 to start the fork from before we merge this, so future fixes for Cilium v1.16 or older can go into that branch. From v1.17 onwards we can start adopting this.

@joestringer joestringer merged commit f3dfc7f into master Jul 16, 2024
2 checks passed
@joestringer joestringer deleted the pr/tammach/ubuntu-24.04 branch July 16, 2024 23:32
sayboras added a commit to cilium/cilium that referenced this pull request Jul 17, 2024
@sayboras
Copy link
Member Author

I looked for similar issues in the tracker and this is what I found:

* [IPv6 datapath fails to load with Maglev and host firewall due to BPF program size limit cilium#14047](https://github.com/cilium/cilium/issues/14047)

* [[arm64] BGP Control Plane + Native Routing + LB-IPAM results in BPF program too large error (1.15-dev) cilium#28651](https://github.com/cilium/cilium/issues/28651)

Interestingly, TestVerifier/bpf_overlay/1 also has ENABLE_IPV6=1 (and the masquerade flag) but it changes with respect to bpf_host/1 in:

* `MAX_HOST_OPTIONS` vs `MAX_OVERLAY_OPTIONS`

* in the overlay config, `LB_SELECTION=1` and `LB_SELECTION_MAGLEV=1`

Thanks for your input, I would defer this investigation to datapath experts (e.g. @gentoo-root or other datapath team member), as it might take more time for newbie like myself 😅.

PS: Kudos to LVH team with awesome steps, local feedback loop is superb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants