-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-1995: Use global map, spinlock, split maps #469
Conversation
Skipping CI for Draft Pull Request. |
Perf tests (https://docs.google.com/spreadsheets/d/1qakBaK1dk_rERO30k1cSR4W-Nn0SXW4A3lqQ1sZC4rE/edit?gid=48358889#gid=48358889) show ~ -25% CPU usage with the global map |
} else { | ||
// Key does not exist in the map, and will need to create a new entry. | ||
u64 rtt = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msherif1234 I don't understand the rationale for this code. We're setting default RTT for TCP flows, how could that be accurate? I don't see how this wouldn't produce fake data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this the min value of ythe hook isn't fire IIRC it was a define in the kernel and I believe we doc that as the default min RTT, when I recall more I will update here
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #469 +/- ##
==========================================
+ Coverage 29.65% 29.68% +0.03%
==========================================
Files 50 51 +1
Lines 4863 4915 +52
==========================================
+ Hits 1442 1459 +17
- Misses 3316 3350 +34
- Partials 105 106 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
cc @tohojo |
removing packed attribute gives another 20% CPU gain |
@jotak: This pull request references NETOBSERV-1995 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
pkg/model/record_test.go
Outdated
@@ -1,107 +0,0 @@ | |||
package model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @msherif1234 , I'm removing this test because I think it is of quite low value: it only tests ReadFrom
, which is really a direct call to binary.Read
. Meaning all it does is testing that the encoding/binary
standard lib is right. This, at a price of sometimes painful maintenance. Does it worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason u removing it because u removed packed so the structure no longer aligns ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the advantage of this make sure the auto generated structure are sane too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get, what sort of issue could it catch, do you have an example?
Anyway, I'm fine reintroducing it. It was not only for the padding change, it's any change in the bpf types that make this test painful to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we go: 119061c
But I'd still appreciate if you can give me a concrete example. It doesn't really test anything about the types.h / go struct mapping, it only test that a go struct is what it claims to be ... that doesn't make really sense to me.
Looked through the BPF code, and it basically LGTM :) |
bpf/flows.c
Outdated
@@ -93,6 +93,7 @@ static inline int flow_monitor(struct __sk_buff *skb, u8 direction) { | |||
// a spinlocked alternative version and use it selectively https://lwn.net/Articles/779120/ | |||
flow_metrics *aggregate_flow = (flow_metrics *)bpf_map_lookup_elem(&aggregated_flows, &id); | |||
if (aggregate_flow != NULL) { | |||
bpf_spin_lock(&aggregate_flow->lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to check the return code for spin lock and unlock and log an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, there's no error code if I'm correct. The doc for bpf_spin_lock says
* Returns
* 0
In linux helper.c it returns void: https://elixir.bootlin.com/linux/v6.12.1/source/kernel/bpf/helpers.c#L283
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc said it was returning something https://docs.ebpf.io/linux/helper-function/bpf_spin_lock/ maybe doc is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does, but apparently it's always 0, not sure why : https://github.com/netobserv/netobserv-ebpf-agent/blob/main/bpf/headers/bpf_helper_defs.h#L2359-L2360
bpf/flows.c
Outdated
@@ -80,7 +80,7 @@ static inline int flow_monitor(struct __sk_buff *skb, u8 direction) { | |||
id.direction = direction; | |||
|
|||
// check if this packet need to be filtered if filtering feature is enabled | |||
bool skip = check_and_do_flow_filtering(&id, pkt.flags, 0); | |||
bool skip = check_and_do_flow_filtering(&pkt, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the need to replace id with pkt struct that generated lots of unnecessary changes and I also planned to even delete the pkt info struct at some point ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is due to removing stuff from flow_id, and moving them to flow_metrics, like MAC addresses. As a result, id isn't sufficient to pass it to all these functions in utils
. So instead of adding tons of arguments to the signatures I just use the pkt
that already has everything needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkt
is convenient to not have extra long signatures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u just removed l2 and interface right ? or am I missing something else ? why would flow filter or util core apis need to change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and eth_protocol
and direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember datapath has to be efficient adding additional dereferrence isn't efficient IMO that stuff not used and I don't think will impact the apis to change the signature much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reaching the 5 args limit of ebpf calls if I want to remove them from pkt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed via 8e69cfd
that allows also to get rid of a few unnecessary MAC copying
|
||
// even if we know that the entry is new, another CPU might be concurrently inserting a flow | ||
// so we need to specify BPF_ANY | ||
long ret = bpf_map_update_elem(&aggregated_flows, &id, &new_flow, BPF_ANY); | ||
long ret = bpf_map_update_elem(&aggregated_flows, &id, &new_flow, BPF_NOEXIST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will probably need to spin lock/unlock here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lock is on the value object, not the key, and it's a new object in that case, so I don't think it makes sense to lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if u have concurrent new flows on different cpus one should create and the other will update ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what the BPF_NOEXIST
flag does: If two CPUs call this concurrently, one of them will fail with EEXIST, and then fall back to the lookup/update path (which does take the lock)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with eexist lookup and update we need to take the lock as well what about BPF_F_LOCK
can I set that flag to takecare of the lock/unlock for me ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BPF_F_LOCK
is for when you want to replace the whole contents of the map value; but that's not what you're doing, you are increasing a couple of counters. So you just re-use exactly the same update code as for when the map lookup succeeds in the first place :)
u16 flags; | ||
u64 latency; | ||
u8 errno; | ||
} dns_record; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we keep dns record here and not move to additional metrics ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I mentioned that in the TODO here #470
I didn't do it at first because it's modified from the TC hook so it isn't required to move it ... but still it would probably be better to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed via 01950d5
bpf/types.h
Outdated
const struct filter_value_t *unused9 __attribute__((unused)); | ||
|
||
// Force emitting enums/structs into the ELF | ||
const enum tcp_flags_t *unused0 __attribute__((unused)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally added that line after each structure definition as reminder so bpf2go can generate eqv go struct for userspace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're just moved at the end of the file. I can rollback that, I don't really care, but I was finding it easier to choose the next "unusedX" number when having all of them in sight at once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not grouping them to gether make it easy to forget but at the cost of looking for the next unused manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i'll rollback that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed via 01950d5
@@ -106,6 +106,7 @@ prereqs: ## Check if prerequisites are met, and install missing dependencies | |||
fmt: ## Run go fmt against code. | |||
@echo "### Formatting code" | |||
go fmt ./... | |||
find ./bpf -type f -not -path "./bpf/headers/*" -name "*.[ch]" | xargs clang-format -i --Werror |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this rule as part of the lint target we can make it part of the fmt but in that case we need to remove it from the lint so we won't run it twice when build ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's for two different things: the format rule allows to auto-format as part of the local build process on your machine, whereas the lint rule is still useful on the CI to make sure we didn't commit & push unformatted code.
4c5c1af
to
314c328
Compare
|
@@ -51,13 +51,27 @@ | |||
*/ | |||
#include "network_events_monitoring.h" | |||
|
|||
static inline void update_dns(additional_metrics *extra_metrics, pkt_info *pkt, int dns_errno) { | |||
if (pkt->dns_id != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can check dns_err here and if its 0
then add DNS info else add dns_error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there can be an error and still have DNS data: when the dns request wasn't found, we can't generate latency, but we still provide the other data, and return ENOENT.
So it shouldn't be a if/else here, if we want to still provide what we have
aggregate_flow->dns_record.flags = pkt->dns_flags; | ||
aggregate_flow->dns_record.latency = pkt->dns_latency; | ||
aggregate_flow->dns_record.errno = dns_errno; | ||
bpf_spin_unlock(&aggregate_flow->lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we no longer needs
netobserv-ebpf-agent/bpf/flows.c
Line 59 in 119061c
// it might happen that start_mono_time hasn't been set due to |
as that was specific to perCPU map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
SrcPort uint16 | ||
DstPort uint16 | ||
TransportProtocol uint8 | ||
IcmpType uint8 | ||
IcmpCode uint8 | ||
_ [3]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example of what removing packed will cause BTW we could avoid so gaps by moving the fields around if u move direction to the bottom might help this struct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in this PR I haven't care of these little optimizations because in the next PR I still move stuff around and there I'm caring about these optimizations
case "icmpCodeIPv4", "icmpCodeIPv6": | ||
ieVal.SetUnsigned8Value(record.Id.IcmpCode) | ||
ieVal.SetUnsigned8Value(record.ID.IcmpCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change from Id
to ID
generated tons of unnecessary changes here in many other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but it's the naming convention & makes golint happier ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make linter happier but screw up all inflightes PRs :(
few nits but I am more less done with review, will be good to run features testing with this PR pre merge |
pls update the arch doc reference to perCPu hashmap and make sure to run #172 (comment) to make sure we have no concurrency issues again |
Checking for zero-values made sense only with per-cpu maps
btw added a diagram for kernel space https://github.com/jotak/netobserv-agent/blob/global-map/docs/architecture.md |
can u share the tool used to draw as we add new hooks on a regular basis so that diag will be quickly become out of date |
@msherif1234 you don't necessarily need a tool you can directly edit script:
So you can easily add a new hook by adding something like:
(M2 stands for the additional metrics map) Or if you prefer there's also online tools such as this: https://mermaid.live/ |
I took some approximative measurements on the ebpf side, written down there: https://docs.google.com/spreadsheets/d/1gdwX4ZLjKicaJzANVq-r-lAW6uZxefqjBsIw3Orx4bo/edit?gid=0#gid=0 while true; do bpftool map show | grep -B 3 netobserv | grep memlock | sed -re 's/.*memlock ([0-9]+)B.*/\1/' | awk '{s+=$1} END {print s}' ; sleep 1; done
For CPU I used bpftop, with even more approximative manual measurement (don't rely too much on my CPU measurements!) Between -28% and -54% memory used depending on scenarios. And between -4% and -36% CPU. The next PR improves that a little bit further with between -42% and -61% memory, and around -20% CPU. Note that if I count the RB size, the improvements are much lower because on my nodes & scenarios it takes most of the size, but since it is constant I think it's worth looking without it. |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=1cd8c04 make set-agent-image |
/lgtm |
/label qe-approved |
@jotak: This pull request references NETOBSERV-1995 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
thanks @msherif1234 @tohojo @Amoghrd ! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jotak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* NETOBSERV-1987: split maps Split BPF maps to avoid having one huge map for flows. This should move us farther from the "stack size limit" that potentially shows up when adding new features. This is also a pre-req for later being able to move away from per-cpu maps, which is wasting memory by duplicating flows across CPUs Also: - remove MAC and eth_protocol from map key (they're moved to flow metrics ie. map values) - in every hook, we don't try anymore to create the flow it it didn't exist. We assume it should exist. Note that, previously, we were adding bytes/packet counters from hooks, which doesn't seem correct as counting bytes/packets is the role of the TC hook only - else it can lead to overcounts - remove code that sets MIN_RTT on TCP flows (not sure why we had that) - small refactoring of the "Accumulate" functions to adapt to new model Use global map & spinlock regen .o Remove packing attributes Packed attributes make linter unhappy with the spin-lock Also, turns out it improves quite a lot the performances to remove that: my tests show another 20% gain Remove low value test for binary read TestRecordBinaryEncoding only tests ReadFrom which... basically is a simple call to binary.Read. So we basically testing the encoding/binary standard lib. Avoid unnecessary indirections & copies of MAC Move DNS to secondary map, revert types.h reorg Fix DNS issue - retry in case of concurrent write Also add an error counter for DNS map updates * Address review comments (renamings) * Make BpfFlowMetrics a pointer in records * Reinsert binary encoding test * Remove unnecessary checks Checking for zero-values made sense only with per-cpu maps * update architecture doc * kernel space mermaid arch * Update architecture.md
Also:
metrics ie. map values)
adding bytes/packet counters from hooks, which doesn't seem correct as
counting bytes/packets is the role of the TC hook only - else it can
lead to overcounts I guess