Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Packed Syscalls #3077

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

alexandruradovici
Copy link
Contributor

@alexandruradovici alexandruradovici commented Jul 8, 2022

Pull Request Overview

This pull request adds the ability to pack several system calls together. It allows an application to issue several system
calls while performing one single user space to kernel space transition. This idea is a follow-up from the working group discussion #3064 when @vsukhoml suggested that Tock might benefit from this.

It defines the 0xfe pseudo system call the receives three parameters:

  • r0 - the number of packed system calls to execute
  • r1 - the address for the system calls' frames
  • r2 - the error policy

It brings only one single modification to the kernel, making public the is_success function from SyscallReturn.

Details

To avoid frequent switching from user space to kernel space, Tock provides the concept of packed system calls. Most applications will follow a similar pattern when using system calls:

  1. allow one or more buffer
  2. subscribe to some events
  3. issue a command
  4. yield
    ---- Optionally
  5. unsubscribe from events
  6. unallow buffers

By using packed system calls, an application is able to execute one single transition from user space to kernel space, by packing items 1, 2, 3 and 4 together and 5 and 6 together.

While the kernel still executes all system calls, it only performs one single transition from user space to kernel space.

Arguments for the actual system calls are sent using a memory buffer. The application can allocate this buffer anywhere in its writable memory. While this seems to be a memory sharing between an application and the kernel, it should be safe due to the following reasons:

  1. The application gets back control only when the packed system calls have been executed
    2. The yield system call can only be used if it is the last system call in the pack.
  2. There can only be one single yield with in packed syscalls. If several yields are present, processes would lose upcalls.
  3. The upcall triggered by yield will be executed after all the system calls in the packed syscalls have been executed.

Each syscall in the pack has an allocated memory frame for its arguments.

                   Argument         Offset (from the pointer)
-----------------+----------------+ 0x00000000
System call 1    | Syscall Number |
                 +----------------+ 0x00000004
                 | r0             |
                 +----------------+ 0x00000008
                 | r1             |
                 +----------------+ 0x0000000c
                 | r2             |
                 +----------------+ 0x00000010
                 | r3             |
-----------------+----------------+ 0x00000014
System call 2    | ....           | ...

Testing Strategy

This pull request was tested by...

TODO or Help Wanted

This pull request still needs...

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • Ran make prepush.

@github-actions github-actions bot added the kernel label Jul 8, 2022
@alistair23
Copy link
Contributor

I like this idea! It would be good to add documentation about it and obviously a RISC-V implementation as well :)

@vsukhoml
Copy link
Contributor

I also like an approach, but thinking about how call site will look like, it seems like there is overhead in preparing a struct. Yes, overall it will be faster, but I'm not so sure about code size impact.

We were exploring alternative approach - command with buffers where additional register is used to point to a simpler struct with just RO and RW Allow buffers, so caller only need to prepare a struct which indicates number of RO and RW buffers, and then just stores address / len pairs. Subscribe/unsubscribe can also be added. Since we focus on synchronous execution, specific sequence is hard coded - allow ro, allow rw, [virtual subscribe], command, wait until completed, [virtual unsubscribe, getting result], unallow rw, unallow ro. This way amount of code required on caller site is minimized, but processing on kernel side becomes slightly more involved.

Hybrid approach can be if more complex, but faster to fill struct with packed syscalls is used.

allow_readonly count: u8,  // idea is to make it possible to have single 32-bit write with all constants
allow_readwrite count: u8,
subscribe_number : u8,

command_code: usize,
arg0: usize,
arg1: usize,

// Sequence of allow buffers
allow_ro_buf: *const u8,
allow_ro_buf_len: usize, // repeated allow_readonly_count

allow_rw_buf: *mut u8,
allow__rwbuf_len: usize, // repeated allow_readonly_count

This way number of instructions to prepare a call will be minimized.

@lschuermann
Copy link
Member

FWIW, this is pretty much exactly what I have envsioned on the call. IMHO a generic mechanism like this is preferable over a more specific one related to buffers such as what @vsukhoml describes, given that we don't generally enforce any particular buffer sharing semanitics w.r.t. to a specific capsule call. Hence, the ability to batch-supply system calls seems like a good, easy to use facility to get around any performance overheads resulting from frequent context switches.

Moreover, I agree with @alistair23 that this should have a corresponding RISC-V implementation. Also, we should supply return values of issued system calls back to userspace (from my quick glimpse over the current PR state, that does not seem to happen yet?). On that note, we might want to also incorporate a limited error handling mechanism, e.g. interrupt the batch processing as soon as a single error is encountered. When we encode return values and whether a specific call has been processed, userspace can determine the exact nature of the error encountered.

@alexandruradovici
Copy link
Contributor Author

alexandruradovici commented Jul 11, 2022

FWIW, this is pretty much exactly what I have envsioned on the call. IMHO a generic mechanism like this is preferable over a more specific one related to buffers such as what @vsukhoml describes, given that we don't generally enforce any particular buffer sharing semanitics w.r.t. to a specific capsule call. Hence, the ability to batch-supply system calls seems like a good, easy to use facility to get around any performance overheads resulting from frequent context switches.

Yes, this was exactly your idea. I agree with @lschuermann here. This is why I do no see any way to support this in a transactional manner, meaning that failed allows and subscribes would be reverted.

Moreover, I agree with @alistair23 that this should have a corresponding RISC-V implementation. Also, we should supply return values of issued system calls back to userspace (from my quick glimpse over the current PR state, that does not seem to happen yet?). On that note, we might want to also incorporate a limited error handling mechanism, e.g. interrupt the batch processing as soon as a single error is encountered. When we encode return values and whether a specific call has been processed, userspace can determine the exact nature of the error encountered.

Return values are already being sent to the application by replacing the system call arguments in the system call frame. The packed system call receives a third argument representing the execution error policy:

  • 1 means that the batch system call will be fully executed, regardless whether errors are encountered or not
  • any other value means stop after the first system call that returns an error

The batch returns either:

  • Success if all system calls have been successfully executed (or the error policy is to continue)
  • Failure U32 stating how many system calls out of the batch have not been executed

/// This decides what happens when one of the syscalls
/// within a packed system call fails.
enum PackedSyscallErrorPolicy {
/// Stop executing the syscalls pack and return the
/// error to the application.
/// This is the default behaviour.
STOP,
/// Continue executing the rest of the syscalls until
/// all the syscalls on the pacl are fully executed.
CONTINUE,
}

@lschuermann
Copy link
Member

Return values are already being sent to the application by replacing the system call arguments in the system call frame.

Ah, sorry, I missed that! This sound pretty good already.

@lschuermann lschuermann reopened this Jul 11, 2022
@lschuermann
Copy link
Member

Sorry, that was unintended. These buttons are way to close to each other. 😅

@github-actions github-actions bot added the risc-v RISC-V architecture label Jul 11, 2022
@alexandruradovici
Copy link
Contributor Author

I added the RISC-V port, @alistair23 I would love some feedback.

@hudson-ayers
Copy link
Contributor

@alexandruradovici do you have any userspace code that targets this system call, even rough code? If so would love to take a look at how that looks. This currently seems to add 800-900 bytes of size to the kernel, so I am curious how much use of these packed system calls is needed to make up for that increase. I understand the goal of packed system calls was both performance and code size, but curious to see how much it actually helps with size in practice.

@lschuermann
Copy link
Member

FWIW, #3080 could be of interested here. I was reminded of #2582 while reviewing this and noticing your use of volatile_ accesses. Long story short, you should not (be required to) use volatile in syscall.rs on both ARM and RISC-V, although on ARM the situation is slightly less clear and generally more confusing than on RISC-V. #3080 attempts to fix this and other issues.

@alistair23
Copy link
Contributor

@alexandruradovici I had a quick look and it looks good. Like Hudson said if there is a userspace implantation I can test and have a closer look at that would be great!

@alexandruradovici
Copy link
Contributor Author

@alexandruradovici I had a quick look and it looks good. Like Hudson said if there is a userspace implantation I can test and have a closer look at that would be great!

I pushed an example tock/libtock-c#284

@alexandruradovici
Copy link
Contributor Author

I tried to improve the code size for RISC-V. I deleted the debug! message and deleted some redundant checks, it seems that the code difference is now around 640 bytes.

Interestingly enough, there seems to be now code size difference for ARMv7 (microbit). I kind of find that hard to believe.

@alexandruradovici
Copy link
Contributor Author

FWIW, #3080 could be of interested here. I was reminded of #2582 while reviewing this and noticing your use of volatile_ accesses. Long story short, you should not (be required to) use volatile in syscall.rs on both ARM and RISC-V, although on ARM the situation is slightly less clear and generally more confusing than on RISC-V. #3080 attempts to fix this and other issues.

I'm not sure that this is the point, but using normal reads and writes instead of volatile_ makes no difference in code size.

@vsukhoml
Copy link
Contributor

This change to be useful should also result in code size savings. Benefits from avoiding many syscalls are only in avoidance of context saving/restoring which is relatively cheap and potential scheduling benefits (though need to check if long sequence of syscalls will 'deny' from service other apps for prolonged time) are less valuable.
However, with call sites (either inlined or in libtock-rs) it is unclear that this implementation will bring any savings at all. Filling struct requires instructions, there is more fields to fill in, etc. We need to take a baseline - say some function with 3 allow buffers and compare current approach with proposed one, understand the breakdown of size differences.

@alexandruradovici
Copy link
Contributor Author

alexandruradovici commented Jul 13, 2022

I see your point @vsukhoml, but having a complex command system call (allow buffer list + subscribe + command) wouldn't require also filling out a longer struct (pointer +size x n)?

I'll do some performance tests and get back to you.

@alexandruradovici
Copy link
Contributor Author

alexandruradovici commented Jul 13, 2022

I did some performance tests. I ran sets of 25, 50 and 100 command system calls to the gpio driver and led driver.

The test bench works in the following way:

  • set a gpio pin
  • run the system calls
  • clear a gpio pin

I measured the time the gpio is high using an oscilloscope connected to the gpio pin.

The test bench source code is in the packed application.

The main indeas seems to be the following:

  • the kernel code overhead is around 600 bytes
  • the application code overhead seems to be around 12 bytes (microbit) and 20 bytes (esp32)
  • the performance improvement seems to be between 1us - 2us per system call

I tried combining the system call number and the driver number into one single uint32_t, but the this added an extra 8 bytes to the compiled binary.

microbit v2

syscalls 25 50 100
gpio (packed/sequential) 690us / 740us 1.340ms / 1.460ms 2.640ms / 2.960ms
led (packed/sequential) 720us / 760us 1.440ms / 1.520ms 2.800ms / 3.040ms

esp32c3-devkitm-1

syscalls 25 50 100
gpio (packed/sequential) 152us / 188us 296us / 360us 580us / 736us
led (packed/sequential) 136us / 170us 256us / 336us 450us / 660us

@lschuermann
Copy link
Member

Thank you for these measurements @alexandruradovici! I do believe that these numbers are somewhat conversative though. @vsukhoml was specifically to pack different system calls into a single invocation, which might increase our code-size savings assuming the code for marshalling and unmarshalling can be made sufficiently efficient (possibly assembly?). That being said, I think the processing overhead already seems reasonably good. As discussed with you today, I can try to get some accurate cycle numbers of RISC-V and test a case combining allow, subscribe, command and unallows in a single invocation through the litex_sim built with the RVFI tracer support.

@vsukhoml
Copy link
Contributor

@alexandruradovici thanks for measurements! it is good if code bloat will remain small. There are few things which makes me thing code size on caller site can grow - filling memory struct requires more instructions than just shuffling values in register, you also need write additional constants, compiler may have to spill registers into memory to free registers and code size impacts depends on ISA, use of registers in surrounding code. And although it can be a chicken/egg problem, but isn't this struct require its own allow? Semantically this packed syscall is allow with provided struct and command which interprets buffer in specific way. On RISC-V using some unused regs like t0-t6 for pointer to that struct would be more efficient as these registers shouldn't be preserved by compiler, so no extra instructions. I suspect not so big size impact comes from collapsed error processing policy which is a nice idea - instead of checks after each syscall, it is just one for all.
Another side effect of using memory is that we no more limited to just 2 arguments per command, and can also pack driver/function/syscall class into single 32-bit value which can be unpacked before invoking kernel functionality. This should reduce code size even more.

@lschuermann yes, RISC-V data is of particular interest.

@alexandruradovici
Copy link
Contributor Author

alexandruradovici commented Jul 14, 2022

There are few things which makes me thing code size on caller site can grow - filling memory struct requires more instructions than just shuffling values in register, you also need write additional constants, compiler may have to spill registers into memory to free registers and code size impacts depends on ISA, use of registers in surrounding code. And although it can be a chicken/egg problem, but isn't this struct require its own allow? Semantically this packed syscall is allow with provided struct and command which interprets buffer in specific way.

An allow is not necessary as memory can be used directly by the low level part of the kernel and it never gets to a capsule. This should be safe as the process does not get control back until the packed system call is fully executed.

On RISC-V using some unused regs like t0-t6 for pointer to that struct would be more efficient as these registers shouldn't be preserved by compiler, so no extra instructions.

I think this can be done from a case by case, but it is not a generic solution. Given the small speed improvement (1us - 2us / system call), the packed system call makes sens for longer packs of system calls.

I suspect not so big size impact comes from collapsed error processing policy which is a nice idea - instead of checks after each syscall, it is just one for all.

I'm not sure I understand your point here.

Another side effect of using memory is that we no more limited to just 2 arguments per command, and can also pack driver/function/syscall class into single 32-bit value which can be unpacked before invoking kernel functionality. This should reduce code size even more.

I tried packing the system call number and command number into a single 32-bit word, but code size increased by 8 bytes.

@alexandruradovici
Copy link
Contributor Author

alexandruradovici commented Jul 14, 2022

I made another test application that uses all the system calls. The used driver does receives allows, a subscribe, and a command.

It seems that using packed system call saves bytes. This is more or less what @lschuermann and I expected when we discussed this.

EDIT: the size decrease is most probably due to the fact that the allow_ and subscribe functions are not included in the binary. As soon as I use these functions in the binary, the size with packed system calls increases by around 40 bytes.

microbit esp32
packed 1212 B 1388 B
sequential 1284 B 1464 B

When using unpacked system calls (by using printf), the code size are:

microbit esp32
packed 5508 B 5440 B
sequential 6808 B 6700 B

@phil-levis
Copy link
Contributor

I'd like to suggest that the compound should terminate early if a syscall fails; the return result can indicate which one it failed on (the number of succeeding calls). vNFS

https://www.usenix.org/conference/fast17/technical-sessions/presentation/chen

is a similar approach used in NFS, and I think there are a lot of similarities we can learn from.

@lschuermann
Copy link
Member

I'd like to suggest that the compound should terminate early if a syscall fails; the return result can indicate which one it failed

I see convincing arguments for both this strategy, and a strategy to follow through with processing the system call batch in spite of an error. While stopping after the first error and reporting the individual call's success or failure values gives userspace flexibility in handling error cases however it likes, having the option to execute subsequent calls after an error allows us to build thin system call wrappers in userspace which are much less complex for the common case. Take the example of an application printing something to the console: the app will perform the following operations: allow, subscribe, command, yield, (un)allow. If the command system call were to fail, the last allow can still execute and userspace does not have to worry about continuing the call execution path were the kernel errored. Perhaps we can mark on a per-call basis (encoded as a single bit) whether errors are tolerated and reported, or grounds to abort the batch processing? Generally though, I do like the approach proposed here.

@a-pronin
Copy link
Contributor

I see convincing arguments for both this strategy, and a strategy to follow through with processing the system call batch in spite of an error.

another possible reason to have a 'structured' input (list of buffers to allow & events to subscribe + command) rather than a list of syscalls. easier to decide what to do to properly finalize the result of the compound call (unallows/unsibscribes still need to be called)

Copy link
Contributor

@bradjc bradjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the standard subscribe-allow-command-yield flow, I think it is appealing to extend the tock syscall model to avoid essentially redundant context switches.

  1. It's cool that this can be done without modifying the core kernel. However, the architecture itself is not what is enabling this optimization, the design of Tock is. As such, I think this should be in the core kernel and not left as a per-arch implementation-specific feature. Writing cross-platform apps gets more difficult if architectures can choose to include this feature or not.

    Following on that, we already have a mechanism to share memory between an app and the kernel, so why is the series of syscalls not stored in an allowed buffer?

  2. The subtleties around yield() are somewhat concerning. The concurrency model for apps is already tricky and confusing for new Tock users, and I'm worried that adding a new caveat for how yield works and where it can be used only adds to the complexity.

    What about a limitation that yield() can only go at the end of a syscall series (if it used at all)? That would reasonably match how syscalls and callbacks work today.

    I'm not opposed to extending this in the future should we learn more about how it is used, but maybe starting with a limited version avoids introducing new confusion.

  3. I'm not sure that packed syscalls captures what these are, as all syscalls are packed into very few registers. "Series Syscalls" is one possible name.

};
Some(switch_reason)
} else {
state.packed_syscall = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be redundant, correct? Otherwise it is strange to set there is no packed syscall after deciding there is no packed syscall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry about that.

@alexandruradovici
Copy link
Contributor Author

1. It's cool that this can be done without modifying the core kernel. However, the architecture itself is not what is enabling this optimization, the design of Tock is. As such, I think this should be in the core kernel and not left as a per-arch implementation-specific feature. Writing cross-platform apps gets more difficult if architectures can choose to include this feature or not.

I think you have a point here, but as this counts towards the code size, some platforms might want to exclude this. Having this in the kernel makes it more difficult. On the other hand, we can add a mechanisms that would check if this is possible, before the app actually issues such a system call series.

   Following on that, we already have a mechanism to share memory between an app and the kernel, so why is the series of syscalls not stored in an `allow`ed buffer?

I don't think the allow overhead is really necessary, as the app does not get back control until all the system calls have been executed and the kernel can actually access the app's memory. I might be wrong, but passing a simple pointer should be fine.

2. The subtleties around `yield()` are somewhat concerning. The concurrency model for apps is already tricky and confusing for new Tock users, and I'm worried that adding a new caveat for how yield works and where it can be used only adds to the complexity.
   What about a limitation that `yield()` can only go at the end of a syscall series (if it used at all)? That would reasonably match how syscalls and callbacks work today.

This is how the PR was initially done, but usually a yield is followed by at least one unallow.

3. I'm not sure that packed syscalls captures what these are, as all syscalls are packed into very few registers. "Series Syscalls" is one possible name.

I agree with it, your naming is much better.

@phil-levis
Copy link
Contributor

We discussed this a bit at Tock World. Yield is a serious issue. Its semantics are that it blocks until an upcall is issued. If there are multiple enabled upcalls it can be that the upcall is distinct from the one this piece of code is expecting. This is why we have the yield_for helper function in libtock-c; it allows a caller to repeatedly call yield until a particular boolean flag is true (a particular upcall was invoked). Because libraries might handle arbitrary upcalls, a caller can't make any assumptions about whether a call to yield will resume on a particular upcall.

In practice, this means that a syscall series can't have a yield in it, unless we add a new yield that has supports "yield until this upcall".

As a result, a string of syscalls would have to be broken into three parts: the setup (as a series), a yield_for, then the teardown (as a series).

@phil-levis phil-levis self-assigned this Jul 29, 2022
@phil-levis
Copy link
Contributor

I think there are three questions to resolve for this to make progress:

  1. Quantifying the exact cost these syscall patterns have and their impact on applications. @vsukhoml has mentioned some crypto use cases that involve ~5 allows on each side of the yield. What latency do these calls impose, and how much overhead does that impose on the total operation? @vsukhoml it would be great if you could provide some numbers here. I think there were also mentions of code size -- quantifying this in contrast to something with composite calls is also necessary. The intuition of many at TockWorld was that the code to construct the composite would be just as long; this intuition might be wrong, however, and if it is, it would be great to have numbers supporting that and correcting it.
  2. Assuming that 1) shows there is a significant performance problem, then we need to determine what composites are allowed, particularly due to the limitations of yield I noted above.
  3. If 2) greatly reduces the benefits of composite calls with respect to the issues raised in 1), then we should explore whether this necessitates either a new version of yield or a new blocking command, as discussed at TockWorld.

@phil-levis
Copy link
Contributor

I made another test application that uses all the system calls. The used driver does receives allows, a subscribe, and a command.

It seems that using packed system call saves bytes. This is more or less what @lschuermann and I expected when we discussed this.

EDIT: the size decrease is most probably due to the fact that the allow_ and subscribe functions are not included in the binary. As soon as I use these functions in the binary, the size with packed system calls increases by around 40 bytes.
microbit esp32
packed 1212 B 1388 B
sequential 1284 B 1464 B

When using unpacked system calls (by using printf), the code size are:
microbit esp32
packed 5508 B 5440 B
sequential 6808 B 6700 B

Can you explain this lower table? Is this table "the userspace library code uses packed system calls but then other code uses unpacked calls"? Why is there this 1300 byte increase?

@vsukhoml
Copy link
Contributor

vsukhoml commented Aug 2, 2022

  1. Quantifying the exact cost these syscall patterns have and their impact on applications. @vsukhoml has mentioned some crypto use cases that involve ~5 allows on each side of the yield. What latency do these calls impose, and how much overhead does that impose on the total operation? @vsukhoml it would be great if you could provide some numbers here.

@phil-levis I don't have latency numbers outside of context of our project and I wouldn't think that context switching is very critical. In our case latency was mostly driven by scheduling upcall which depends on number of active apps. I was mostly focused on code size overhead required to perform these allow and unallow syscalls.

I think there were also mentions of code size -- quantifying this in contrast to something with composite calls is also necessary. The intuition of many at TockWorld was that the code to construct the composite would be just as long; this intuition might be wrong, however, and if it is, it would be great to have numbers supporting that and correcting it.

There is not big difference between filling struct in memory and loading values in registers. Yes, may be some savings or losses, but this wouldn't be large - you still need to set same values somewhere. The major saving is actually moving checks of syscall result into the kernel, so you don't repeat it after each syscall. If we can reduce amount of values needed to configure operations by using better encoding - this should help with code size. This is why I propose a more fixed-function structure rather than just a raw sequence of syscall - it saves on building this struct in memory.

  1. Assuming that 1) shows there is a significant performance problem, then we need to determine what composites are allowed, particularly due to the limitations of yield I noted above.

So far performance is only a problem due to scheduling of syscalls. It can be fixed by changing logic in scheduler - prioritize processes with active upcalls.

  1. If 2) greatly reduces the benefits of composite calls with respect to the issues raised in 1), then we should explore whether this necessitates either a new version of yield or a new blocking command, as discussed at TockWorld.

For us a sequence of subscribe / command / yield / upcall / unsubscribe is what affects performance. We should either return results in command syscall, thus command or some specific functions will become blocking, or say we can use variant of yield to retrieve values of upcall and avoid subscribe/upcall/unsubscribe.

@phil-levis
Copy link
Contributor

phil-levis commented Aug 3, 2022

So far performance is only a problem due to scheduling of syscalls. It can be fixed by changing logic in scheduler - prioritize processes with active upcalls.

  1. If 2) greatly reduces the benefits of composite calls with respect to the issues raised in 1), then we should explore whether this necessitates either a new version of yield or a new blocking command, as discussed at TockWorld.

For us a sequence of subscribe / command / yield / upcall / unsubscribe is what affects performance. We should either return results in command syscall, thus command or some specific functions will become blocking, or say we can use variant of yield to retrieve values of upcall and avoid subscribe/upcall/unsubscribe.

Can you define performance in this context? I'm not sure what performance metric you mean.

@vsukhoml
Copy link
Contributor

vsukhoml commented Aug 3, 2022

By performance I mean wall clock time overhead to complete sequence of allow/subscribe/command/yield/unsubscribe/unallow vs. time to actually perform a function. In case of command being 'async' in kernel and upcall not being executed immediately, this time depends on behavior of other apps and might include time slices for other apps.

@alexandruradovici
Copy link
Contributor Author

Can you explain this lower table? Is this table "the userspace library code uses packed system calls but then other code uses unpacked calls"? Why is there this 1300 byte increase?

Sure. The code size decrease seems to be due to the fact that there is only one single function that make a system call (the asm code). These numbers are the code size when an application uses all the standard system calls, and as such has all the possible asm code for each system call. I think that one way of reducing code size is to have a single generic system call function that provide asm code and use it from thing wrappers. I will try to do that.

@alexandruradovici
Copy link
Contributor Author

I think I agree with @bradjc that this functionality should be in the core kernel. On the other hand I think it should be generic and only allow packing of system calls, rather than having a semantic structure. As @lschuermann and I suggested, I think having one extra bit would solve the conditional execution.

@phil-levis
Copy link
Contributor

phil-levis commented Sep 20, 2022

@vsukhoml wrote:

@phil-levis I don't have latency numbers outside of context of our project and I wouldn't think that context switching is very critical. In our case latency was mostly driven by scheduling upcall which depends on number of active apps. I was mostly focused on code size overhead required to perform these allow and unallow syscalls.

Ah -- by context switch I mean a change in execution context (the stack being used), which is part of an upcall. But it sounds like you are concerned with code size more than time.

@alexandruradovici wrote:

I think I agree with @bradjc that this functionality should be in the core kernel. On the other hand I think it should be generic and only allow packing of system calls, rather than having a semantic structure. As @lschuermann and I suggested, I think having one extra bit would solve the conditional execution.

I'm open to this idea, but can you describe a situation in which it would be useful? @lschuermann's example of the allow doesn't make sense if a sequence can't include a yield.

@alexandruradovici
Copy link
Contributor Author

I'll start working on this to port it to the kernel.

@alevy
Copy link
Member

alevy commented Jul 28, 2023

Following the discussion today at Tockworld, I think this is worth reviving.

I believe the availability of a yield_for in the kernel (as @ppannuto suggested in person today) would ameliorate the remaining concerns from the usability of this.

It would be worth evaluating whether these packed calls would actually save a meaningful (relative to the added complexity) amount of code size or performance after various other low hanging fruit optimizations, possible especially with a yield_for in the kernel. But especially give this is already prototyped, that should be a relatively quick evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel risc-v RISC-V architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants