-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Packed Syscalls #3077
base: master
Are you sure you want to change the base?
Enable Packed Syscalls #3077
Conversation
I like this idea! It would be good to add documentation about it and obviously a RISC-V implementation as well :) |
I also like an approach, but thinking about how call site will look like, it seems like there is overhead in preparing a struct. Yes, overall it will be faster, but I'm not so sure about code size impact. We were exploring alternative approach - Hybrid approach can be if more complex, but faster to fill struct with packed syscalls is used.
This way number of instructions to prepare a call will be minimized. |
FWIW, this is pretty much exactly what I have envsioned on the call. IMHO a generic mechanism like this is preferable over a more specific one related to buffers such as what @vsukhoml describes, given that we don't generally enforce any particular buffer sharing semanitics w.r.t. to a specific capsule call. Hence, the ability to batch-supply system calls seems like a good, easy to use facility to get around any performance overheads resulting from frequent context switches. Moreover, I agree with @alistair23 that this should have a corresponding RISC-V implementation. Also, we should supply return values of issued system calls back to userspace (from my quick glimpse over the current PR state, that does not seem to happen yet?). On that note, we might want to also incorporate a limited error handling mechanism, e.g. interrupt the batch processing as soon as a single error is encountered. When we encode return values and whether a specific call has been processed, userspace can determine the exact nature of the error encountered. |
Yes, this was exactly your idea. I agree with @lschuermann here. This is why I do no see any way to support this in a transactional manner, meaning that failed allows and subscribes would be reverted.
Return values are already being sent to the application by replacing the system call arguments in the system call frame. The packed system call receives a third argument representing the execution error policy:
The batch returns either:
tock/arch/cortex-m/src/syscall.rs Lines 43 to 54 in d8f4927
|
Ah, sorry, I missed that! This sound pretty good already. |
Sorry, that was unintended. These buttons are way to close to each other. 😅 |
I added the RISC-V port, @alistair23 I would love some feedback. |
@alexandruradovici do you have any userspace code that targets this system call, even rough code? If so would love to take a look at how that looks. This currently seems to add 800-900 bytes of size to the kernel, so I am curious how much use of these packed system calls is needed to make up for that increase. I understand the goal of packed system calls was both performance and code size, but curious to see how much it actually helps with size in practice. |
FWIW, #3080 could be of interested here. I was reminded of #2582 while reviewing this and noticing your use of |
@alexandruradovici I had a quick look and it looks good. Like Hudson said if there is a userspace implantation I can test and have a closer look at that would be great! |
I pushed an example tock/libtock-c#284 |
9ef77c4
to
fc9fbab
Compare
I tried to improve the code size for RISC-V. I deleted the Interestingly enough, there seems to be now code size difference for ARMv7 (microbit). I kind of find that hard to believe. |
I'm not sure that this is the point, but using normal reads and writes instead of |
This change to be useful should also result in code size savings. Benefits from avoiding many syscalls are only in avoidance of context saving/restoring which is relatively cheap and potential scheduling benefits (though need to check if long sequence of syscalls will 'deny' from service other apps for prolonged time) are less valuable. |
I see your point @vsukhoml, but having a complex command system call (allow buffer list + subscribe + command) wouldn't require also filling out a longer struct (pointer +size x n)? I'll do some performance tests and get back to you. |
I did some performance tests. I ran sets of 25, 50 and 100 command system calls to the gpio driver and led driver. The test bench works in the following way:
I measured the time the gpio is high using an oscilloscope connected to the gpio pin. The test bench source code is in the packed application. The main indeas seems to be the following:
I tried combining the system call number and the driver number into one single microbit v2
esp32c3-devkitm-1
|
Thank you for these measurements @alexandruradovici! I do believe that these numbers are somewhat conversative though. @vsukhoml was specifically to pack different system calls into a single invocation, which might increase our code-size savings assuming the code for marshalling and unmarshalling can be made sufficiently efficient (possibly assembly?). That being said, I think the processing overhead already seems reasonably good. As discussed with you today, I can try to get some accurate cycle numbers of RISC-V and test a case combining allow, subscribe, command and unallows in a single invocation through the |
@alexandruradovici thanks for measurements! it is good if code bloat will remain small. There are few things which makes me thing code size on caller site can grow - filling memory struct requires more instructions than just shuffling values in register, you also need write additional constants, compiler may have to spill registers into memory to free registers and code size impacts depends on ISA, use of registers in surrounding code. And although it can be a chicken/egg problem, but isn't this struct require its own allow? Semantically this packed syscall is @lschuermann yes, RISC-V data is of particular interest. |
An allow is not necessary as memory can be used directly by the low level part of the kernel and it never gets to a capsule. This should be safe as the process does not get control back until the packed system call is fully executed.
I think this can be done from a case by case, but it is not a generic solution. Given the small speed improvement (1us - 2us / system call), the packed system call makes sens for longer packs of system calls.
I'm not sure I understand your point here.
I tried packing the system call number and command number into a single 32-bit word, but code size increased by 8 bytes. |
I made another test application that uses all the system calls. The used driver does receives allows, a subscribe, and a command. It seems that using packed system call saves bytes. This is more or less what @lschuermann and I expected when we discussed this. EDIT: the size decrease is most probably due to the fact that the
When using unpacked system calls (by using
|
I'd like to suggest that the compound should terminate early if a syscall fails; the return result can indicate which one it failed on (the number of succeeding calls). vNFS https://www.usenix.org/conference/fast17/technical-sessions/presentation/chen is a similar approach used in NFS, and I think there are a lot of similarities we can learn from. |
I see convincing arguments for both this strategy, and a strategy to follow through with processing the system call batch in spite of an error. While stopping after the first error and reporting the individual call's success or failure values gives userspace flexibility in handling error cases however it likes, having the option to execute subsequent calls after an error allows us to build thin system call wrappers in userspace which are much less complex for the common case. Take the example of an application printing something to the console: the app will perform the following operations: |
another possible reason to have a 'structured' input (list of buffers to allow & events to subscribe + command) rather than a list of syscalls. easier to decide what to do to properly finalize the result of the compound call (unallows/unsibscribes still need to be called) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the standard subscribe-allow-command-yield flow, I think it is appealing to extend the tock syscall model to avoid essentially redundant context switches.
-
It's cool that this can be done without modifying the core kernel. However, the architecture itself is not what is enabling this optimization, the design of Tock is. As such, I think this should be in the core kernel and not left as a per-arch implementation-specific feature. Writing cross-platform apps gets more difficult if architectures can choose to include this feature or not.
Following on that, we already have a mechanism to share memory between an app and the kernel, so why is the series of syscalls not stored in an
allow
ed buffer? -
The subtleties around
yield()
are somewhat concerning. The concurrency model for apps is already tricky and confusing for new Tock users, and I'm worried that adding a new caveat for how yield works and where it can be used only adds to the complexity.What about a limitation that
yield()
can only go at the end of a syscall series (if it used at all)? That would reasonably match how syscalls and callbacks work today.I'm not opposed to extending this in the future should we learn more about how it is used, but maybe starting with a limited version avoids introducing new confusion.
-
I'm not sure that packed syscalls captures what these are, as all syscalls are packed into very few registers. "Series Syscalls" is one possible name.
}; | ||
Some(switch_reason) | ||
} else { | ||
state.packed_syscall = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be redundant, correct? Otherwise it is strange to set there is no packed syscall after deciding there is no packed syscall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry about that.
I think you have a point here, but as this counts towards the code size, some platforms might want to exclude this. Having this in the kernel makes it more difficult. On the other hand, we can add a mechanisms that would check if this is possible, before the app actually issues such a system call series.
I don't think the allow overhead is really necessary, as the app does not get back control until all the system calls have been executed and the kernel can actually access the app's memory. I might be wrong, but passing a simple pointer should be fine.
This is how the PR was initially done, but usually a
I agree with it, your naming is much better. |
We discussed this a bit at Tock World. Yield is a serious issue. Its semantics are that it blocks until an upcall is issued. If there are multiple enabled upcalls it can be that the upcall is distinct from the one this piece of code is expecting. This is why we have the In practice, this means that a syscall series can't have a yield in it, unless we add a new yield that has supports "yield until this upcall". As a result, a string of syscalls would have to be broken into three parts: the setup (as a series), a |
I think there are three questions to resolve for this to make progress:
|
Can you explain this lower table? Is this table "the userspace library code uses packed system calls but then other code uses unpacked calls"? Why is there this 1300 byte increase? |
@phil-levis I don't have latency numbers outside of context of our project and I wouldn't think that context switching is very critical. In our case latency was mostly driven by scheduling upcall which depends on number of active apps. I was mostly focused on code size overhead required to perform these allow and unallow syscalls.
There is not big difference between filling struct in memory and loading values in registers. Yes, may be some savings or losses, but this wouldn't be large - you still need to set same values somewhere. The major saving is actually moving checks of syscall result into the kernel, so you don't repeat it after each syscall. If we can reduce amount of values needed to configure operations by using better encoding - this should help with code size. This is why I propose a more fixed-function structure rather than just a raw sequence of syscall - it saves on building this struct in memory.
So far performance is only a problem due to scheduling of syscalls. It can be fixed by changing logic in scheduler - prioritize processes with active upcalls.
For us a sequence of subscribe / command / yield / upcall / unsubscribe is what affects performance. We should either return results in command syscall, thus command or some specific functions will become blocking, or say we can use variant of yield to retrieve values of upcall and avoid subscribe/upcall/unsubscribe. |
Can you define performance in this context? I'm not sure what performance metric you mean. |
By performance I mean wall clock time overhead to complete sequence of allow/subscribe/command/yield/unsubscribe/unallow vs. time to actually perform a function. In case of command being 'async' in kernel and upcall not being executed immediately, this time depends on behavior of other apps and might include time slices for other apps. |
Sure. The code size decrease seems to be due to the fact that there is only one single function that make a system call (the asm code). These numbers are the code size when an application uses all the standard system calls, and as such has all the possible asm code for each system call. I think that one way of reducing code size is to have a single generic system call function that provide asm code and use it from thing wrappers. I will try to do that. |
I think I agree with @bradjc that this functionality should be in the core kernel. On the other hand I think it should be generic and only allow packing of system calls, rather than having a semantic structure. As @lschuermann and I suggested, I think having one extra bit would solve the conditional execution. |
@vsukhoml wrote:
Ah -- by context switch I mean a change in execution context (the stack being used), which is part of an upcall. But it sounds like you are concerned with code size more than time. @alexandruradovici wrote:
I'm open to this idea, but can you describe a situation in which it would be useful? @lschuermann's example of the allow doesn't make sense if a sequence can't include a yield. |
I'll start working on this to port it to the kernel. |
Following the discussion today at Tockworld, I think this is worth reviving. I believe the availability of a It would be worth evaluating whether these packed calls would actually save a meaningful (relative to the added complexity) amount of code size or performance after various other low hanging fruit optimizations, possible especially with a |
Pull Request Overview
This pull request adds the ability to pack several system calls together. It allows an application to issue several system
calls while performing one single user space to kernel space transition. This idea is a follow-up from the working group discussion #3064 when @vsukhoml suggested that Tock might benefit from this.
It defines the
0xfe
pseudo system call the receives three parameters:It brings only one single modification to the kernel, making public the
is_success
function fromSyscallReturn
.Details
To avoid frequent switching from user space to kernel space, Tock provides the concept of
packed system calls
. Most applications will follow a similar pattern when using system calls:---- Optionally
By using packed system calls, an application is able to execute one single transition from user space to kernel space, by packing items 1, 2, 3 and 4 together and 5 and 6 together.
While the kernel still executes all system calls, it only performs one single transition from user space to kernel space.
Arguments for the actual system calls are sent using a memory buffer. The application can allocate this buffer anywhere in its writable memory. While this seems to be a memory sharing between an application and the kernel, it should be safe due to the following reasons:
2. The yield system call can only be used if it is the last system call in the pack.Each syscall in the pack has an allocated memory frame for its arguments.
Testing Strategy
This pull request was tested by...
TODO or Help Wanted
This pull request still needs...
Documentation Updated
/docs
, or no updates are required.Formatting
make prepush
.