Replies: 2 comments
-
Hey @simaocat the function in question just invokes cachegrand/src/support/io_uring/io_uring_support.c Lines 436 to 451 in eb2f796 The entire purpose of This mean that the submission is already batched. The number you are seeing, the "amount", can't really be changed as it would tell the kernel to wait for more than 1 event - e.g. for at least 2 TCP/IP or 2 timer events - however this also means that when the value is set to 1 if there are 10 events pending they will all be returned, not just one, therefore returning batches of responses as well. The system is already "batching" the submissions and the processing of the responses. cachegrand does only 1 syscall every time the scheduler processes all the events (it's not entirely true but not relevant, more below). A potential improvement is instead to stop "batching" the submissions and submit the single operations immediately using More details on the The function is just a wrapper around the normal submit and wait, but it doesn't wait
Potentially it might still cause the user space to have to stop and enter in kernel mode BUT only if the ring goes to sleep, which under usage wouldn't happen. This approach might open the doors to also not having to wait altogether, as there are some bits implemented by io_uring (now stable) that open this door, but it's something I need to explore as it requires pooling the ring However to be honest I don't expect this to be a gamechanger in terms of performance or latency but it might bring some light benefits when you have fewer clients connected (more clients -> less syscalls per second, less clients -> more syscalls per second). The flamegraph you are seeing is perfectly normal (and beautiful from a performance point of view): what you are seeing is not that the syscall takes time but that the TCP/IP and Timer operations take time, infact if you see from the Which version did you stresstest? The latest release or main? About the hugepages, do you have a comparison between with hugepages enabled and with hugepages disabled? Not sure you can see an immediate benefits using the hugepages, you might actually see worse performances initially: allocating an hugepage takes longer than allocating a normal page so until mimalloc "stabilizes", and it might take a while depending on the kind of stresstest, mimalloc might be allocating new pages which the kernel has to zero before returning them (which makes everything slow and that's why every memory allocator on earth caches memory pages fetched with mmap). NOTE:
The client latency is therefore killed by the client behaviour itself, even improving this aspect would result in an extremely limited benefit. |
Beta Was this translation helpful? Give feedback.
-
@simaocat if you want to do a benchmark test to see how things will behave using
basically right before any
The changes must be done off of the commit #7c57026 (which is the branch 0.4.0), don't use main as base as there are plenty of changes done (the line numbers shouldn't be different between main and the commit id I shared). The function The |
Beta Was this translation helpful? Give feedback.
-
Recently, I conducted a stress test with cachegrand, using memtier_benchmark on Ubuntu 22.04 with kernel version 5.15, and enabled huge pages memory. I used a perf flame graph to observe performance bottlenecks, as shown in the figure below. The main overhead was in memory allocation and the io_uring_submit_and_wait function. From the code in worker_iouring.c, I saw the call at io_uring_support_sqe_submit_and_wait(context->ring, 1). Each time the wait_nr passed in is equal to 1. Is it possible to change it to batch submission and batch waiting? After I modified it to a value greater than 1 for wait_nr locally, the stress test reported an error. So I would like to ask if it is possible to adjust this place to a value greater than 1, in order to reduce the number of system calls through batch waiting.
Beta Was this translation helpful? Give feedback.
All reactions