performance issue #449

simaocat · 2025-01-08T04:32:56Z

simaocat
Jan 8, 2025

Recently, I conducted a stress test with cachegrand, using memtier_benchmark on Ubuntu 22.04 with kernel version 5.15, and enabled huge pages memory. I used a perf flame graph to observe performance bottlenecks, as shown in the figure below. The main overhead was in memory allocation and the io_uring_submit_and_wait function. From the code in worker_iouring.c, I saw the call at io_uring_support_sqe_submit_and_wait(context->ring, 1). Each time the wait_nr passed in is equal to 1. Is it possible to change it to batch submission and batch waiting? After I modified it to a value greater than 1 for wait_nr locally, the stress test reported an error. So I would like to ask if it is possible to adjust this place to a value greater than 1, in order to reduce the number of system calls through batch waiting.

danielealbano · 2025-01-08T12:43:31Z

danielealbano
Jan 8, 2025
Maintainer

Hey @simaocat the function in question just invokes io_uring_submit_and_wait, more details here

cachegrand/src/support/io_uring/io_uring_support.c

Lines 436 to 451 in eb2f796

    
           bool io_uring_support_sqe_submit_and_wait( 
        
                   io_uring_t *ring, 
        
                   int wait_nr) { 
        
               int res; 
        
               if ((res = io_uring_submit_and_wait(ring, wait_nr)) < 0) { 
        
                   LOG_E( 
        
                           TAG, 
        
                           "Failed to submit the io_uring sqes, error code <%s (%d)>", 
        
                           strerror(res), 
        
                           res); 
        
                   return false; 
        
               } 
        
               return true; 
        
           }

The entire purpose of io_uring_submit_and_wait is to submit all the requested operations to the kernel and then wait for one or more operations complete, here more details https://man7.org/linux/man-pages/man3/io_uring_submit_and_wait.3.html .

This mean that the submission is already batched.

The number you are seeing, the "amount", can't really be changed as it would tell the kernel to wait for more than 1 event - e.g. for at least 2 TCP/IP or 2 timer events - however this also means that when the value is set to 1 if there are 10 events pending they will all be returned, not just one, therefore returning batches of responses as well.

The system is already "batching" the submissions and the processing of the responses.

cachegrand does only 1 syscall every time the scheduler processes all the events (it's not entirely true but not relevant, more below).

A potential improvement is instead to stop "batching" the submissions and submit the single operations immediately using io_uring_submit as the submission don't really require a syscall at all (the whole motivation behind of io_uring) so that when it gets to the wait there is a major chance that more than one operation has already been started and even completed.

More details on the io_uring_submit helper here https://man7.org/linux/man-pages/man3/io_uring_submit.3.html

The function is just a wrapper around the normal submit and wait, but it doesn't wait

Potentially it might still cause the user space to have to stop and enter in kernel mode BUT only if the ring goes to sleep, which under usage wouldn't happen.

This approach might open the doors to also not having to wait altogether, as there are some bits implemented by io_uring (now stable) that open this door, but it's something I need to explore as it requires pooling the ring

However to be honest I don't expect this to be a gamechanger in terms of performance or latency but it might bring some light benefits when you have fewer clients connected (more clients -> less syscalls per second, less clients -> more syscalls per second).

The flamegraph you are seeing is perfectly normal (and beautiful from a performance point of view): what you are seeing is not that the syscall takes time but that the TCP/IP and Timer operations take time, infact if you see from the entry_SYSCALL_64_after_hwframe to the __x64_sys_io_uring_enter the flamegraph barely changes and instead after the iouring enter you see that it splits perfectly in half between the submission, which is mostly TCP/IP data receive which is a VERY slow operation, and processing the responses and waiting for the high precision timer to do it's own stuff.
About the high precision timer, did you stresstest cachegrand on a VM or in the cloud? It's curious that the scheduling is basicaly as slow as processing an skb datagram.

Which version did you stresstest? The latest release or main?

About the hugepages, do you have a comparison between with hugepages enabled and with hugepages disabled? Not sure you can see an immediate benefits using the hugepages, you might actually see worse performances initially: allocating an hugepage takes longer than allocating a normal page so until mimalloc "stabilizes", and it might take a while depending on the kind of stresstest, mimalloc might be allocating new pages which the kernel has to zero before returning them (which makes everything slow and that's why every memory allocator on earth caches memory pages fetched with mmap).

NOTE:
operations like accepting a new socket or closing a socket aren't done entirely via io_uring and therefore syscalls are issued, however they are not really relevant unless you have a client that does 1 operation and then closes the connection which would have impacts:

on the server side, as the kernel needs to deal with the handshaking and other operations
and on the client side, the OS of the client itself would have to deal with this same thing

The client latency is therefore killed by the client behaviour itself, even improving this aspect would result in an extremely limited benefit.

0 replies

danielealbano · 2025-01-08T13:34:31Z

danielealbano
Jan 8, 2025
Maintainer

@simaocat if you want to do a benchmark test to see how things will behave using io_uring_submit, the code changes aren't too complex, a simplified version might just be adding the following snippet

int res_io_uring_submit = io_uring_submit(context->ring);
if (unlikely(res_io_uring_submit < 0) {
    fiber_scheduler_set_error(res_io_uring_submit);
    return -res_io_uring_submit;
}

basically right before any fiber_scheduler_switch_back within src/worker/network/worker_network_iouring_op.c:

The changes must be done off of the commit #7c57026 (which is the branch 0.4.0), don't use main as base as there are plenty of changes done (the line numbers shouldn't be different between main and the commit id I shared).

The function io_uring_submit seems doesn't fail ever from the documentation, however to be on the cautios side the code snippet implements a simple error checking which shouldn't hurt (also it uses an unlikely to ensure also the processor is not impacted by the branching).

The io_uring_submit_and_wait within the scheduler can stay as-is, there is no performance impact in "not submitting" anything, it will just wait.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issue #449

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

performance issue #449

simaocat Jan 8, 2025

Replies: 2 comments

danielealbano Jan 8, 2025 Maintainer

danielealbano Jan 8, 2025 Maintainer

simaocat
Jan 8, 2025

danielealbano
Jan 8, 2025
Maintainer

danielealbano
Jan 8, 2025
Maintainer