-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The benchmark is invalid #3
Comments
Hi Ben, Thanks for your comments! Please find my replies below:
Please note that the thread number spans from 1 to 24 in our experiments, which covers not only "more threads than CPU cores", but also less and equal threads. If you are interested in the peak performance of SPDK, you can find it at the data point of 6 threads (same as the CPU core number of our evaluation machine). We're also aware of SPDK's impotence under more threads than CPU cores and discuss it in Sec. 6.2 of our paper (citation 54 also offers an excellent explanation).
We use the queue depth of 1 only in the close-loop experiments of comparing the latency, because this setting gives the lowest latency. When trying to compare the max throughput, we did use a similar setting as you suggested, i.e., increase the queue depth and submit in batches (Figure 7). Hope my response answers your question. We're happy to talk more. Thank you! |
Full disclosure - I'm one of the creators of SPDK and designed many of these APIs, including SPDK's thread multiplexing system. I really am the authority on what is or isn't correct usage of SPDK. And you are doing it very wrong.
This statement is horribly wrong. The problem is not at all that event-based systems like SPDK or io_uring don't work well when you oversubscribe the cores. The problem is that you have failed to understand how to use them correctly. These systems multiplex many events from different sources and tasks onto a single thread per core. BY DESIGN. Using more than one thread per core does not make any sense and can only be considered an invalid configuration. Your paper's discussion of these systems is deeply flawed.
Even in the open loop case, unless you are setting the rate parameter which isn't mentioned in the paper, it's going to busy poll for 16us after every submission for absolutely no reason. This roughly lines up with the ~100k IOPs per core you're seeing in the benchmark. If you were doing it right, you'd get more than 10 million per core. See the available benchmarks in the SPDK project for how to do it correctly. Look - I don't dislike the general idea of a BPF resubmit function. I don't think it has much chance of being merged due to the layering violations, but I'm not opposed to the concept and think it has definite advantages for applications written to do synchronous IO from thread pools. But the use of both SPDK and io_uring here is just incorrect, making all of your comparisons invalid. |
Hi Ben, We’re glad to be in contact with someone from the SPDK project, and appreciate your feedback.
We agree that SPDK was not originally designed to run with multiple threads per core. And this is exactly what we are showing in the multi-threading experiment. Indeed our experiment shows that to achieve its peak performance, SPDK should run with one single thread per core. This experiment precisely leads us to the conclusion in the paper: when the thread number is less than the cores, SPDK offers the best performance, and XRP is closer to it than read() and io_uring. Whereas, once there is high CPU contention (i.e., multiple threads on a core), XRP scales better than others. We can certainly envision situations with high CPU contention where a thread with high CPU usage is contending with a thread that is using SPDK. For example, this can happen with non-cooperative applications that share the same core. Our multi-threading experiment is a proxy for such a high CPU contention scenario -- perhaps not a perfect proxy.
The Y axis units in Figure 7 are actually the number of BPF-KV operations, not the total IOPS. Given an index depth of 6, one BPF-KV operation equals 7 I/Os (6 index lookups + 1 lookup to fetch the value). In Figure 7, with 6 threads, SPDK achieves 697k BPF-KV OPS (116k per core), which translates to 4.9M IOPS (813k per core). Given the max 512B-random-read throughput of the device that we used is 5.0M IOPS, SPDK’s throughput matches the device IOPS. Regarding the rate parameter, we’re not sure what you’re referring to. We set a target rate of generating the KV requests to the hardware limit. |
I understand your line of thinking now and there are a couple of mistakes still. First, your paper and your comments here still lead the reader to believe that running SPDK with more than one thread per core is a valid thing to do. It's not. Phrasing such as "not originally designed to run with multiple threads per core", where "originally" implies that the situation may have changed since it was originally written is misleading. Second, I understand your focus is on systems where the CPU resources are oversubscribed. Unfortunately, your proxy for a system that's oversubscribed is not only "not a perfect proxy" but not even a reasonable one. What you've done is take a 6 core system and run more than 6 SPDK threads on it that are busy waiting, so they're constantly contending with one another. To state it again, you'd never run more than 1 SPDK thread per core for any reason. What can happen, as you note, is that other applications on the system may steal time away from the SPDK threads. That's not the intended way to use SPDK, but that's something far more reasonable to explore. A benchmark that runs 6 SPDK cores plus another application that simulates doing compute work on maybe 3 to 5 of the cores is what the proxy should have been. Further, for the benchmark of XRP, you run more than 6 threads that are not busy waiting and instead nicely yield their timeslice to the kernel for interrupts to be processed. Again, what you should have done here was make 6 XRP threads and then had another application simulating doing compute work on some number of cores. This application would have interfered in the interrupt processing.
I understand the math now, but it's ridiculous to scale up a busy-polling system like SPDK to more cores than it needs to saturate the hardware. SPDK should never be running more than 1 thread on a system only capable of 5M I/Ops. You left off lower thread counts from Figure 7 (a) - I'd highly recommend you go back and run those. SPDK should saturate that disk using just one thread if your benchmark is working correctly. And if you have a better workload proxy that burns 3-5 of the 6 cores doing other compute, you'd see that the actual optimal way to share the cores is to dedicate one core to SPDK and 5 to the other compute task.
Here's your benchmark's main loop per thread:
You should instead have each thread issue queue depth worth of reads, then sit in a while loop calling spdk_nvme_qpair_process_completions(). When each operation completes, it should continue on to the next step from within the completion callback. There should be no sleeps, no rate limits, and no new command submissions within the while loop. It is an event-based system - everything should be driven by events. |
This benchmark has at least two fatal flaws.
This is why you see extremely low performance from SPDK and no better performance with io_uring than a blocking read system call.
The text was updated successfully, but these errors were encountered: