Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] new(proposal): drop support for syscall enter events #2068

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Andreagit97
Copy link
Member

@Andreagit97 Andreagit97 commented Sep 13, 2024

What type of PR is this?

/kind design

Any specific area of the project related to this PR?

/area proposals

Does this PR require a change in the driver versions?

No

What this PR does / why we need it:

This document proposes removing support for syscall enter events in our codebase. The primary reason behind this proposal is to reduce the throughput of events that our userspace needs to process. As the number of syscalls increases, we are no longer able to support them, and we start dropping events. See the full proposal for more details!

I'm curious to hear some feedback on this, we talked many times about it and maybe now is the right time to tackle it!

Some related discussions:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@poiana
Copy link
Contributor

poiana commented Sep 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Andreagit97

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.69%. Comparing base (e2c5174) to head (0a14985).
Report is 259 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2068      +/-   ##
==========================================
- Coverage   74.31%   73.69%   -0.62%     
==========================================
  Files         253      253              
  Lines       30966    31914     +948     
  Branches     5403     5608     +205     
==========================================
+ Hits        23011    23520     +509     
- Misses       7942     8363     +421     
- Partials       13       31      +18     
Flag Coverage Δ
libsinsp 73.69% <ø> (-0.62%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Sep 13, 2024

Perf diff from master - unit tests

     4.20%     +1.68%  [.] next
     3.89%     +0.88%  [.] sinsp_evt::load_params
     4.49%     -0.84%  [.] sinsp_parser::process_event
    11.01%     -0.67%  [.] sinsp_parser::reset
     0.50%     +0.58%  [.] std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, libsinsp::state::dynamic_struct::field_info>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, libsinsp::state::dynamic_struct::field_info> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find
     0.83%     -0.55%  [.] std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, libsinsp::state::dynamic_struct::field_info>, true> > >::_M_deallocate_nodes
     1.09%     -0.55%  [.] std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Identity, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::find
     6.13%     -0.52%  [.] sinsp::next
     1.60%     +0.50%  [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
     3.45%     -0.48%  [.] sinsp_thread_manager::get_thread_ref

Heap diff from master - unit tests

peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Heap diff from master - scap file

peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Benchmarks diff from master

Comparing gbench_data.json to /root/actions-runner/_work/libs/libs/build/gbench_data.json
Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
BM_sinsp_split_mean                                            -0.0075         -0.0075           153           152           153           152
BM_sinsp_split_median                                          -0.0135         -0.0135           153           151           153           151
BM_sinsp_split_stddev                                          +0.0518         +0.0524             1             1             1             1
BM_sinsp_split_cv                                              +0.0597         +0.0604             0             0             0             0
BM_sinsp_concatenate_paths_relative_path_mean                  -0.0238         -0.0238            63            62            63            62
BM_sinsp_concatenate_paths_relative_path_median                -0.0294         -0.0294            63            62            63            62
BM_sinsp_concatenate_paths_relative_path_stddev                +0.5897         +0.5890             0             1             0             1
BM_sinsp_concatenate_paths_relative_path_cv                    +0.6285         +0.6278             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_mean                     -0.0061         -0.0061            25            25            25            25
BM_sinsp_concatenate_paths_empty_path_median                   -0.0049         -0.0049            25            25            25            25
BM_sinsp_concatenate_paths_empty_path_stddev                   +0.0329         +0.0354             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_cv                       +0.0393         +0.0418             0             0             0             0
BM_sinsp_concatenate_paths_absolute_path_mean                  +0.0420         +0.0420            61            64            61            64
BM_sinsp_concatenate_paths_absolute_path_median                +0.0498         +0.0498            61            64            61            64
BM_sinsp_concatenate_paths_absolute_path_stddev                -0.5577         -0.5576             1             1             1             1
BM_sinsp_concatenate_paths_absolute_path_cv                    -0.5755         -0.5754             0             0             0             0
BM_sinsp_split_container_image_mean                            +0.0072         +0.0072           396           399           396           399
BM_sinsp_split_container_image_median                          +0.0057         +0.0057           396           398           396           398
BM_sinsp_split_container_image_stddev                          -0.1906         -0.1861             3             3             3             3
BM_sinsp_split_container_image_cv                              -0.1964         -0.1919             0             0             0             0

@Andreagit97
Copy link
Member Author

any feedback on this? I continuously see many users with a lot of drops...

@Andreagit97
Copy link
Member Author

@falcosecurity/libs-maintainers I will try to create a branch ASAP to show you some concrete code on this

@FedeDP
Copy link
Contributor

FedeDP commented Oct 8, 2024

I would love to join you for this enormous task 🚀

Copy link
Contributor

@Molter73 Molter73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sold on the idea. As I mention on another comment, this proposal is basically saying that we should rewrite all drivers and parsing logic. Even if that is not the case, it will take tons of changes in them which can cause bugs, regressions and what-have-yous. The proposal states that we will be sending half the data to userspace in some parts, but if all data is moved to exit events, that is not actually true, we will send half the events that will be almost twice as big, meaning we will be saving the length of the enter event header, maybe a bit more if we find redundant data on some events.

I also don't see the need to move ALL syscalls to use only exit events, I would argue most of the performance impact surely comes from syscalls like read/open/send/recv and their relatives, since they are the most frequently triggered (the syscall fingerprint even shows these are triggered at least twice as much as other syscalls), why not try changing only some of those and see if that is good enough?

IDK, it all seems like a huge amount of efforts, a big risk and I don't see how the enormous gains the proposal is being sold on will manifest themselves. I'd be happy to be proven wrong though.

Comment on lines +322 to +327
* Update our three drivers to collect all parameters in the exit fillers and remove the enter ones.
* Update scap-file management to convert all enter events generated over the years into new exit events.
* Update our userspace engines to support only exit events.
* Update the sinsp parsing and extraction logic to use only exit events.
* Update all tests (drivers, libscap, unit, e2e) to use only exit events.
* Update all documentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of work, and I mean A LOT of work. From where I stand, this is basically a rewrite of the parsing logic of sinsp and all 3 drivers. This is a ton of risk and I'm not convinced moving everything to exit events will have the huge impact you are expecting it to have. In my mind, for me to sign off on this, you'd need to have CPU and data savings of way above 50%, otherwise it's too much risk of having regressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I can fully understand you. This is a lot of work.
The point of this proposal is that today we don't scale on machines with more than 96 CPUs (an example here https://gist.github.com/stevenbrz/69391aa71b22d205b6add88ae10fb905#file-gistfile1-txt-L126-L166) and we need to find a solution. What is proposed here is not the final solution but at least I see it as a first step toward it. IMO Going from 10 Mevt/s to 5 Mevt/s is already something

you'd need to have CPU and data savings of way above 50%

Yes, we probably won't reach something like this but we will process half of the event in userspace and this sounds like already something good. Unfortunately having precise data before doing the change would be very complicated...

@FedeDP
Copy link
Contributor

FedeDP commented Oct 8, 2024

@Molter73
As far as i see it, we will:

  • send and then parse half of the events, and that means we can increase the throughput in userspace and have much less drops. Even if not doubling, a 30/40% increase alone would be huge and IMHO worth the big effort.
  • by changing the size of some type (eg: fd and errno to 32bits) we will save lots of precious bytes for every event sent to userspace
  • reach a more consistent codebase, since we will have less parsers, less parsers' magic (eg: store_evt()/retrieve_old_evt())
  • no more syscall_enter related wake-ups for our drivers
  • halve the glue code to support new syscalls: no more 2 events in the event_table, 2 fillers, 2 driver tests, 2 parsers, and so on. This is a nice gain for the futurity of the project.

I know andre has more data to back his proposal for sure :)

@Molter73
Copy link
Contributor

Molter73 commented Oct 8, 2024

  • send and then parse half of the events

Fair, but the amount of data and fields being parsed would still be almost the same, right? What I mean is, you would only skip the enter header, then go on to parse the same number of fields effectively, you'd just be doing that in one go, so the total processing done in userspace would be the same.

  • Even if not doubling, a 30/40% increase alone would be huge and IMHO worth the big effort.

You are very optimistic about the amount of gains we'll get, I don't think it's gonna get even close to that, but again, I'd be happy to be proven wrong.

  • by changing the size of some type (eg: fd and errno to 32bits) we will save lots of precious bytes for every event sent to userspace

I don't see what this has to do with moving all fields to the exit side. Why couldn't we do this with the existing model?

  • reach a more consistent codebase, since we will have less parsers, less parsers' magic (eg: store_evt()/retrieve_old_evt())

  • no more syscall_enter related wake-ups for our drivers

  • halve the glue code to support new syscalls: no more 2 events in the event_table, 2 fillers, 2 driver tests, 2 parsers, and so on. This is a nice gain for the futurity of the project.

These 3 are fair points, though they still require a massive rewrite that I'm not fully OK with.

If you guys can show me how we would get 30% performance improvement from just shuffling things around, I'll gladly eat my words, for now, all I see is a massive rewrite that will be a pain to go through and put us in a position for a bunch of new bugs to pop up.

@Andreagit97
Copy link
Member Author

Andreagit97 commented Oct 8, 2024

Fair, but the amount of data and fields being parsed would still be almost the same, right? What I mean is, you would only skip the enter header, then go on to parse the same number of fields effectively, you'd just be doing that in one go, so the total processing done in userspace would be the same.

Uhm more or less. Today for the most frequent events we store the whole event in memory and we reuse it in the exit.
just to give you a pointer

case PPME_SYSCALL_OPEN_E:

This is not only a cost (in scenarios where we have 4 Million open/s we save 2 Millions of OPEN_E per second, same for close, socket, etc...) but it's also a useless complication and a risk... sometimes if we drop the enter event we are not able to correctly process the exit one and so we end up doing a sort of approximation without any reason. Why should the information be shared between enter and exit events?

so the total processing done in userspace would be the same.

Moreover, we need to consider that there is not only the parsing phase, after the parsing the event is sent to all the filter checks and we need to evaluate all the filter checks on events that don't carry a real value...

You are very optimistic about the amount of gains we'll get, I don't think it's gonna get even close to that, but again, I'd be happy to be proven wrong.

Well, that's true we have no data to discuss about this and unfortunately, it's very difficult to obtain them without implementing the solution...

by changing the size of some type (eg: fd and errno to 32bits) we will save lots of precious bytes for every event sent to userspace

I don't see what this has to do with moving all fields to the exit side. Why couldn't we do this with the existing model?

This would require generating new event pairs for almost all the events that have at least one between fd and pid fields. It makes sense to do it if part of a broader refactor like the one we are talking about here.

If you guys can show me how we would get 30% performance improvement from just shuffling things around, I'll gladly eat my words, for now, all I see is a massive rewrite that will be a pain to go through and put us in a position for a bunch of new bugs to pop up.

Well, I see your point, really.
On the driver side, I'm not so worried, we have tests for each syscall and we don't need to change the logic, we just need to move the logic in only on filler.
The userspace side is for sure more worrying but doing it for the open syscall I can say that is more of a burden fixing the tests accordingly than updating the parsers... Enter events are just stored in memory and reused in the exit they don't have a big role in the core parsing logic, so having all the parameters in the exit just simplifies the flow without adding additional complexity, or at least this is how I see it. Moreover, we need to highlight that many events are not parsed at all in userspace so we need to update nothing.

Doing this work just for one event I've already spotted some inconsistencies/bugs, so it's true that we will introduce for sure some new bugs but today we already have many of them hidden somewhere. So at the end of the work, I'm not sure what would be the balance of introduced/fixed bugs...

BTW in any case I would proceed with a separate branch, doing PR after PR would be very difficult and dangerous as you correctly highlighted.

Apart from the huge amount of changes I don't see many other downsides

  • we should achieve better instrumentation time kernel side
  • less CPU time spent in userspace
  • cleaner and more intuitive userspace logic

But yes I see how these changes can be scary...

Signed-off-by: Andrea Terzolo <[email protected]>
Co-authored-by: Mauro Ezequiel Moltrasio <[email protected]>
@Andreagit97
Copy link
Member Author

Just because we talked about byte saving i would add a bit of information. This is not strictly related to this initiative but if we decide to go in this way we could think about it...

Today event schema

/*
 * SCAP EVENT FORMAT:
 *
 * +------------------------------------------+
 * |  Header	| u16 | u16 | param1 | param2 |
 * +------------------------------------------+
 * ^        	   ^                             ^
 * |        	   |                             |
 * ppm_hdr  	   |                             |
 * lengths_arr--+                             |
 * raw_params---------------------------------+
 */

// 26 bytes
struct ppm_evt_hdr {
	uint64_t ts; /* timestamp, in nanoseconds from epoch */
	uint64_t tid; /* the tid of the thread that generated this event */
	uint32_t len; /* the event len, including the header */
	uint16_t type; /* the event type */
	uint32_t nparams; /* the number of parameters of the event */
};

Why don't we do something like this?

/* +----------+
 * |  Header  |
 * +----------+
 */

typedef uint16_t evt_hdr;


/* Concrete example */
[PPME_SYSCALL_CLOSE_X] = {"close",
                              	EC_IO_OTHER | EC_SYSCALL,
                              	EF_DESTROYS_FD | EF_USES_FD | EF_MODIFIES_STATE,
                              	2,
                              	{{"res", PT_ERRNO, PF_DEC},
                              	{{"fd", PT_FD, PF_DEC}}},



/* +--------------------------------------------+
 * |  TYPE_CLOSE_X  | len | tid | ts | res | fd |
 * +--------------------------------------------+
 *
 * - Total len: 16 + 16 + 32 + 64 + 32 + 32 -> (we save 4 extra bytes because we don't need the len of `res` and `fd`). Note that this is called `2/3 Million/s`)
 */

/* +----------------------------------------------------------------------------------------------+
 * |  TYPE_OPEN_X  | len | tid | ts | res | fd | flags | mode | dev | ino | len_name (u16) | name |
 * +----------------------------------------------------------------------------------------------+
 *
 * - Total len: 16 + 16 + 32 + 64 + 32 + 32 + 32 + 32 + 32 + 64 + 16 + strlen(name) -> we save 12 extra bytes
 *
 */

/* +----------------------------------------------------------------------------------------------+
 * |  TYPE_MMAP_X  | len | tid | ts | res | ... |
 * +----------------------------------------------------------------------------------------------+
 *
 * - This design would allow us to handle special cases, for example mmap has a return value on 64 bit and we can easily handle it because sinsp will know it when it will parse an event `TYPE_MMAP_X`
 *
 */
  • The header is only the type on 16 bits
  • every type can have its own schema.
  • we don't have the len array anymore we just push the total len and the len before each variable-size parameter. For fixed-size parameters, we already have the len in the event table codified in its type.

Just to give a complete full example of a possible saving. Consider the PPME_SYSCALL_OPENAT_2_X event:

 [PPME_SYSCALL_OPENAT_2_X] = {"openat",
                               EC_FILE | EC_SYSCALL,
                               EF_CREATES_FD | EF_MODIFIES_STATE,
                               7,
                               {{"fd", PT_FD, PF_DEC},
                                {"dirfd", PT_FD, PF_DEC},
                                {"name", PT_FSRELPATH, PF_NA, DIRFD_PARAM(1)},
                                {"flags", PT_FLAGS32, PF_HEX, file_flags},
                                {"mode", PT_UINT32, PF_OCT},
                                {"dev", PT_UINT32, PF_HEX},
                                {"ino", PT_UINT64, PF_DEC}}},

Today: 26 (header) + 14 (len array) + 8 (fd on 8 bit) + 8 + X (for the name) + 4 + 4 + 4 + 8 = 76 Bytes + X

With the proposed approach: 16 (type+len+tid+ts) + 4 (fd on 4 bytes) + 4 (dirfd on 4 bytes) + 4 + 4 + 4 + 8 + 2 (only the name field needs the len) + X = 46 bytes + X

We save 30 bytes!

@gnosek
Copy link
Contributor

gnosek commented Oct 9, 2024

Taking a step back: okay, we can't handle $MILLIONS evt/sec right now, but why exactly? Is the bottleneck in scap_next (and dropping the enter events will help a lot) or is it in the sinsp parsers (and dropping half the events will help only a little bit, as we save on dispatch but the data fields we process are basically the same)?

If you happen to have a 96 core machine to play with, it might be interesting to see how fast we can get with just scap (e.g. scap-open with any sort of printing disabled) vs sinsp (sinsp-example, again, with printing disabled to keep the playing field level).

My expectation is that raw scap will scale to an absurd extent (no reason it wouldn't be faster than memcpy, it's basically an atomic fetch and add per event batch) and it's sinsp that's slow, but I'm ready to be surprised.

(if you do run this test, please gather perf profiles and make some pretty flame graphs :))

Note that we have a mostly random sleep inside (kmod's) scap_next, so on less than full load the perf won't be ideal.
BTW, did we ever consider a futex (equivalent) instead of the sleep?

@Andreagit97 I like the proposed new layout a lot, though I'm mildly worried about having a variable-length res field. I'd sleep minimally better if it was just native register size (8 bytes these days) unconditionally.

@Molter73
Copy link
Contributor

Molter73 commented Oct 9, 2024

Uhm more or less. Today for the most frequent events we store the whole event in memory and we reuse it in the exit. just to give you a pointer

case PPME_SYSCALL_OPEN_E:

Yes, I know about the storing logic, that's not ideal but it's basically a memcpy, I'm sure playing around with allocators could give better performance (arena allocators, use something like tcmalloc, etc..), again, this doesn't need a large refactoring. I do, on the other hand, strongly agree that we shouldn't drop entire events when we drop the enter side, this is a way stronger selling point for the proposal than performance IMO.

Moreover, we need to consider that there is not only the parsing phase, after the parsing the event is sent to all the filter checks and we need to evaluate all the filter checks on events that don't carry a real value...

I'm not super familiar with filterchecks, but I would argue this is a mistake on the filtercheck logic which should ignore events that are not useful, not on the driver and parsing logic.

Doing this work just for one event I've already spotted some inconsistencies/bugs, so it's true that we will introduce for sure some new bugs but today we already have many of them hidden somewhere. So at the end of the work, I'm not sure what would be the balance of introduced/fixed bugs...

I recently read an interesting article on how new code is way more likely to have bugs and vulnerabilities than old code, so if you've already spotted bugs on one syscall, I think we can assume we are gonna come out the other end with a negative balance. I'm not trying to completely push back here, but we do need to be aware of the risk of doing such a large refactoring.

BTW in any case I would proceed with a separate branch, doing PR after PR would be very difficult and dangerous as you correctly highlighted.

Just a suggestion on the dev workflow here, I wouldn't use a single branch with all the changes, that will be a nightmare to review. I suggest you have a feature branch and open PRs against that branch with small changes (maybe one or two syscalls per PR?), that way the changes can be thoroughly reviewed and the final merge to master should be less painful for everyone.

  • every type can have its own schema.

How do you handle driver <-> scap compatibility like this? If the schema changes in any way, you will no longer be able to parse an event from an older driver, right? If a field changes from 32 to 64 bits, how would scap know to only read 32 bits when the older driver is being used?

I agree with everything in @gnosek's comment BTW, running the tests he proposes would be very interesting.

@Andreagit97
Copy link
Member Author

Taking a step back: okay, we can't handle $MILLIONS evt/sec right now, but why exactly? Is the bottleneck in scap_next (and dropping the enter events will help a lot) or is it in the sinsp parsers (and dropping half the events will help only a little bit, as we save on dispatch but the data fields we process are basically the same)?

If you happen to have a 96 core machine to play with, it might be interesting to see how fast we can get with just scap (e.g. scap-open with any sort of printing disabled) vs sinsp (sinsp-example, again, with printing disabled to keep the playing field level).

My expectation is that raw scap will scale to an absurd extent (no reason it wouldn't be faster than memcpy, it's basically an atomic fetch and add per event batch) and it's sinsp that's slow, but I'm ready to be surprised.

I took some data on 80 CPUs in the past for a completely different reason (ringbuffer scraping) but I think they still report something relevant. https://docs.google.com/document/d/1CLIP4cdAGa6MQnMffQKeVT7Z18rzAyX6XkLZ-HFns1E/edit

  1. It shows the difference between parsing an openat and a fstat in sinsp on 16 CPUs. Storing failed openat in memory has probably a not negligible cost when we do it 3 M/s times.
  2. It shows that the bottleneck is in sinsp because with scap we can handle huge throughputs (~ 14 Mevt/s on 80 CPUs with a sequential buffer approach)
  3. It shows that on 80 CPUs we hit some limits when we try to handle more than 2,4 Mevt/s in sinsp. Please note that we are talking about 1,2 Msyscall/s because we have enter+exit. As stated in the document we say that we support throughputs of 2,4 Mevt/s but this depends on:
    • How the traffic is distributed among the available CPUs
    • How the syscall are distributed in time (e.g. 5 syscall + 40 usec of sleep)

TL;DR; the solution proposed here is not a definitive solution, but it should allow us to gain some time while we come up with a definitive solution (multi-threading, kernel filtering, whatever...). Moreover, apart from halving the number of events we send to userspace, this solution brings some important optimization

  • it allows us to reduce kernel-side instrumentation time (something we've always tried to do - see the document here eBPF vs. KMod Performance #267).
  • It allows us to avoid storing events in memory, as in the case of openat.
  • If we apply optimizations to the event schema, it allows us to save a considerable number of bytes per event.
  • It also allows us to simplify the parsing flow a bit, reducing inconsistencies and complex logic that we currently have.

To be honest, aside from the major refactor, I see only possible improvements...

@Andreagit97
Copy link
Member Author

@Andreagit97 I like the proposed new layout a lot, though I'm mildly worried about having a variable-length res field. I'd sleep minimally better if it was just native register size (8 bytes these days) unconditionally.

that makes sense but instead of using a common PT_ERRNO type for all the return values we could assign the precise type the return value has directly in the event table, so for open/openat we will have PT_FD for mmap PT_POINTER and so on, in this way in the event table we will find the information on the return value size

        [PPME_SYSCALL_OPEN_X] = {"open", ..., {{"res", PT_FD, PF_DEC},
        [PPME_SYSCALL_MMAP_X] = {"mmap", ..., {{"res", PT_POINTER, PF_HEX},

And so on, instead of a generic PT_ERRNO

@gnosek
Copy link
Contributor

gnosek commented Oct 9, 2024

What I'm concerned about is having to look up the event schema before we even know if the event is successful or not. Maybe it's not a real issue, idk 🤷

@Andreagit97
Copy link
Member Author

How do you handle driver <-> scap compatibility like this? If the schema changes in any way, you will no longer be able to parse an event from an older driver, right? If a field changes from 32 to 64 bits, how would scap know to only read 32 bits when the older driver is being used?

That's a good point.
If we decide to change the schema for sure we will bump a major for the SCHEMA VERSION, so new libs versions won't be able to use old driver versions. This is for live captures. If we talk about replaying a capture (scap-files) we need a converter but this should be quite easy to implement since we have versions for the event blocks

@Andreagit97
Copy link
Member Author

Andreagit97 commented Oct 9, 2024

What I'm concerned about is having to look up the event schema before we even know if the event is successful or not. Maybe it's not a real issue, idk 🤷

If we see this is an issue I agree with you that we should fall back to something standard like always 64 bits

@gnosek
Copy link
Contributor

gnosek commented Oct 9, 2024

Just a suggestion on the dev workflow here, I wouldn't use a single branch with all the changes, that will be a nightmare to review. I suggest you have a feature branch and open PRs against that branch with small changes (maybe one or two syscalls per PR?), that way the changes can be thoroughly reviewed and the final merge to master should be less painful for everyone.

Alternatively, we can have a global (per sinsp) flag to drop the enter events and introduce support for it gradually (merging each step to master, with passing tests for both scenarios). This won't require a long-lived side branch that will be a pain to maintain before it's merged (ask me how I know). In the end, we can remove the flag (permanently dropping enter events) or leave it as is if we want.

@leogr
Copy link
Member

leogr commented Oct 9, 2024

Since the discussion is becoming interesting, I'm joining in with my two cents 😸

I've been talking with Andrea about this proposal for a while, and even though I tend to be conservative, I think there's an opportunity for a substantial improvement here.

I guess the core point here is the issue of drops. We have been evaluating many solutions for years, and honestly, anything we have approached has just mitigated the situation. Still, we have never seen a real improvement. The root cause is likely the legacy design, which we have never tried to change radically.

This proposal favors structural changes that may result in a substantial improvement. So, I'm not against it unless the numbers show something different.

Regarding bugs, I'm not so worried since we are already in a situation where it's tough to discover and fix them (primarily due to the legacy design). We have to deal with bugs anyway, so this should not be a blocker. Somewhat the new solution should reduce the moving parts a bit, which might even result in easier maintenance and fewer bugs in the long term.

That said, I would like to give more reasons once we have a PoC. Now that we don't plan to introduce significant new features, I would definitely avoid stagnation.

@leogr
Copy link
Member

leogr commented Oct 9, 2024

Just a suggestion on the dev workflow here, I wouldn't use a single branch with all the changes, that will be a nightmare to review. I suggest you have a feature branch and open PRs against that branch with small changes (maybe one or two syscalls per PR?), that way the changes can be thoroughly reviewed and the final merge to master should be less painful for everyone.

👍

@Molter73
Copy link
Contributor

Molter73 commented Oct 9, 2024

Just a suggestion on the dev workflow here, I wouldn't use a single branch with all the changes, that will be a nightmare to review. I suggest you have a feature branch and open PRs against that branch with small changes (maybe one or two syscalls per PR?), that way the changes can be thoroughly reviewed and the final merge to master should be less painful for everyone.

Alternatively, we can have a global (per sinsp) flag to drop the enter events and introduce support for it gradually (merging each step to master, with passing tests for both scenarios). This won't require a long-lived side branch that will be a pain to maintain before it's merged (ask me how I know). In the end, we can remove the flag (permanently dropping enter events) or leave it as is if we want.

That could work too, I just don't want a 50k+ lines changed PR to suddenly show up in the repo 🙂

@Andreagit97
Copy link
Member Author

Andreagit97 commented Oct 30, 2024

In the last weeks, I did some research for this workstream. Here is the branch (https://github.com/Andreagit97/libs/pull/new/remove_sys_enter) I used to have an idea of the possible changes that will be required, kernel side + userspace side. This branch will never see the light but it was useful to me to evaluate possible approaches.

The best idea I can come up with is to proceed with incremental PRs against libs master:

  1. incrementally add new events in the event table (with all parameters in the exit) and add parsers/filterchecks for them. The idea is to create a sort of "parallel" userspace flow that shouldn't impact today's one. So essentially what you suggested here [WIP] new(proposal): drop support for syscall enter events #2068 (comment). Of course, we will need unit tests for these new events, I think we could use the already existing sinsp_with_test_input framework.
  2. while we add the new event we can add the corresponding converters for scap_files. The idea would be to add a small converter library inside the scap-file engine with a declarative conversion logic. Something similar to this
static std::unordered_map<ppm_event_code, conversion_info> g_conversion_table = {
        {PPME_SYSCALL_OPEN_E, {.instr = {{C_ACTION_SKIP}}}},
        {PPME_SYSCALL_OPEN_X,
         {.desired_type = PPME_SYSCALL_OPEN,
          .valid_param_nums = {4, 6},
          .instr = {{C_FROM_OLD_EVENT | C_MOD_TO_32, 0},
                    {C_FROM_OLD_EVENT, 1},
                    {C_FROM_OLD_EVENT, 2},
                    {C_FROM_OLD_EVENT, 3},
                    {C_FROM_OLD_EVENT, 4},
                    {C_FROM_OLD_EVENT, 5}}}},
         ...
}

Of course, each converter should be paired with its own unit tests. You can see some examples in the above branch.
The creation of the converter is not strictly required in this phase but it would be probably easier to do it while we add the new events to the event table
3. When the userspace is ready to support all the new events, we could try to switch a first kernel driver to the new logic. Please note that until this point all the drivers are still working with the same logic we are using today so they are still sending enter + exit events, the parallel userspace logic with just the exit events is there but should be never called in production. The best candidate for the switch is probably the modern ebpf driver (the main reason is that is the one I've more confidence with)
4. If everything works great we could gradually switch all the drivers to the new logic.

Of course, if we realize that this approach is too complex or too unstable we can switch to a feature branch and work on that with PRs. We leave the master in a stable state and we continue the work in the branch. Please let me know what you think about this approach and if you have alternative ideas/suggestions!

Side note n1: I'm happy we already spotted some bugs/inconsistency looking a little bit deeper inside filterchecks/parsers + there was also room for some small cleanups! See:

Side note n2: The schema rework we talked about here #2068 (comment) is not part of this initiative. It could be a follow-up but it's not strictly related.

@Molter73
Copy link
Contributor

Agree with the proposed method of incrementally adding new events to the master branch, but I'm not 100% sure on the "all or nothing" approach to switching handling. Think it'd be worth it to make it possible to change the implementation being used by using some sort of configuration flag (for those syscalls that have it, obviously), since that would allow progressive testing as new implementations are added.

Maybe the implementation can happen on a separate event table and we can choose between the 2? Or, if kept ordered at the end of the existing table, we could just use an offset? I don't want to block on this, but I would like to see some thought put into it.

@leogr
Copy link
Member

leogr commented Oct 31, 2024

but I'm not 100% sure on the "all or nothing" approach to switching handling.

Do you mean when we will need to switch drivers to the new approach?

Think it'd be worth it to make it possible to change the implementation being used by using some sort of configuration flag (for those syscalls that have it, obviously), since that would allow progressive testing as new implementations are added.

It would make sense, but if I understood correctly, having both implementations in parallel on the kernel side wouldn't be manageable (not sure, I'm just guessing). In the end, a feature branch is a good fallback solution if we reach a point where things become too cumbersome to manage.

That said, I'm okay with both approaches. I guess we are still in a too-early stage, and we will likely find which fits better later.

@Andreagit97
Copy link
Member Author

Andreagit97 commented Nov 7, 2024

but I'm not 100% sure on the "all or nothing" approach to switching handling. Think it'd be worth it to make it possible to change the implementation being used by using some sort of configuration flag (for those syscalls that have it, obviously), since that would allow progressive testing as new implementations are added.

Once we have sinsp working with both flavors maybe we could rework the drivers in a feature branch so that every consumer can test the new logic from the feature branch without touching the drivers in libs. In this way, we can do the final switch only when we are really sure that everything is working. WDYT?

We could also do everything in the feature branch, so also the sinsp changes, but in my opinion, if possible, it would be better to use master to:

  • avoid conflicts as much as possible
  • reach an agreement as fast as possible
  • have a final feature branch with "only" the driver changes and tests rework, so smaller and easier to integrate.

If we see that this becomes a nightmare of instability we should probably go with a feature branch only. What i don't love of partial changes in the drivers is:

  • We need to handle situations in which some events are enter+exit (so old logic) and others are exit only (new logic)
  • all consumers of libs should speak both "languages" so (falco, plugins, rules, ...) should be aware of both types of events and should be updated as well during the rework causing possible instabilities.

@Andreagit97
Copy link
Member Author

Andreagit97 commented Nov 22, 2024

After another iteration, I came up with something that probably is even more conservative.

The idea is to work only in libs/master without forks, using incremental steps. Each PR will contain the following:

  1. Add enter parameters to the exit event without touching anything else. This would be a minor change in the event schema since we are just adding new parameters to the exit one. This must be done for all drivers (kmod, ebpf, modern ebpf, gvisor). To give you an example
/////////////////////////// FROM
        [PPME_SYSCALL_READ_E] = {"read",
                                 EC_IO_READ | EC_SYSCALL,
                                 EF_USES_FD | EF_READS_FROM_FD,
                                 2,
                                 {{"fd", PT_FD, PF_DEC}, {"size", PT_UINT32, PF_DEC}}},
        [PPME_SYSCALL_READ_X] = {"read",
                                 EC_IO_READ | EC_SYSCALL,
                                 EF_USES_FD | EF_READS_FROM_FD,
                                 2,
                                 {{"res", PT_ERRNO, PF_DEC},
                                  {"data", PT_BYTEBUF, PF_NA}}},

/////////////////////////// TO
        [PPME_SYSCALL_READ_E] = {"read",
                                 EC_IO_READ | EC_SYSCALL,
                                 EF_USES_FD | EF_READS_FROM_FD,
                                 2,
                                 {{"fd", PT_FD, PF_DEC}, {"size", PT_UINT32, PF_DEC}}},
        [PPME_SYSCALL_READ_X] = {"read",
                                 EC_IO_READ | EC_SYSCALL,
                                 EF_USES_FD | EF_READS_FROM_FD,
                                 4,
                                 {{"res", PT_ERRNO, PF_DEC},
                                  {"data", PT_BYTEBUF, PF_NA},
                                  {"fd", PT_FD, PF_DEC},
                                  {"size", PT_UINT32, PF_DEC}}},
  1. Adapt sinsp state to work just with exit events. Enter events can still be generated but they won't populate the state anymore. The idea is to use a feature flag in sinsp to enable/disable enter events. Please note! This means that old driver versions won't work with new libs versions because they cannot correctly populate the sinsp state since it will work only with exit events. The idea is to bump also the minor schema version required by scap
    #define SCAP_MINIMUM_DRIVER_SCHEMA_VERSION PPM_API_VERSION(2, 0, 0)
  2. Create a scap-file conversion (in a dedicated scap-file converter) to convert ENTER events into merged EXIT ones. Since we are here we will also handle old event versions in this converter., but each PR will add conversion just for one event pair
  3. Add some tests replaying scap-files and checking that everything is good.

At the end of the work, we should have libs master that works in 2 modes:

  • only exit events
  • exit + enter events (like today). Nothing has changed for consumers, they still see the same enter events, the only change here is for libs that internally won't use them anymore.

Hope this could be a good trade-off between our previous ideas.

Side note: this workstream won't include any event-schema refactor so no new event types and no new event header. I'm not a fan of this but I can see how this could slow down the whole initiative.

Note: Using exit + enter will have a slightly greater cost than today because the exit event will send more parameters kernel side.

@Andreagit97
Copy link
Member Author

Andreagit97 commented Nov 29, 2024

This work could cause some redefinitions of fillers (you will see in the next PR) that's the reason why I raised this #1283 (comment).
The idea is:

the bump of SCAP_MINIMUM_DRIVER_SCHEMA_VERSION in the first point is required because new scap versions won't be able to work with old driver versions. The userspace won't be able to handle enter events so the driver should send the new "full" exit event.
On the other side, new drivers should work with old scap versions because they will send the enter events and old scap versions have the code to handle enter events

@leogr
Copy link
Member

leogr commented Dec 10, 2024

@Andreagit97

As I wrote in the comment on the other issue, I'm against this (unless there's some aspect I'm not fully understanding).
Why would "adding a thing" create a backward incompatible change? 🤔
It looks like a workaround for a design issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants