-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor multithreaded performance #343
Comments
@pikrzysztof You have impeccable timing. @zenhack and I are scheduled to pair up today and dig into performance-tuning for the RPC system. Would you be willing to share your application/benchmark code? In the meantime, can you confirm which code revision you are using? |
@lthibault Thank you for a prompt response. I can try writing some synthetic benchmark, but I cannot share the original code, but this may take a while. I was using latest v2 release, i.e. |
I think this was when Kenton hit this on the C++ implementation: https://twitter.com/kentonvarda/status/1553120336018870272 |
@pikrzysztof I can't recall if this specific issue has evolved between v2 and v3, but I would strongly encourage you to switch to v3. Any improvements are going to happen there. It's currently labeled as alpha, but I'm currently running it in production and haven't had any major issues. YMMV, of course. |
@lthibault the spinlock is still there, please see https://github.com/capnproto/go-capnproto2/blob/0a7a48ce93a8cdf8c932689d71edc812a9e1e617/message.go#L152-L163 |
So it is. Thanks! We're focusing our initial perf-tuning efforts on memory allocations in the A synthetic benchmark would indeed be helpful in the meantime. |
That's not a spinlock; the compare & swap isn't implementing a lock it's just retrying if the value was modified from underneath it. We could probably make this a better by using atomic.AddInt64 instead of having a loop that retries (and changing the limit to be signed, and checking for negatives). ...but it is quite sad that we even need to do atomic operations for reads in the first place. The more invasive change that I think might make sense would be to instead of having a single global limit attached to the message, have a limit that is tracked separately, and can then just be per-goroutine. Maybe store a pointer to it in types like |
Here's a quick & dirty benchmark that hits the hotpath: https://github.com/pikrzysztof/gocapnp/blob/master/main_test.go Please note that this is a benchmark on a laptop CPU. Scaling across sockets can be even worse.
You can see the more goroutines we create, the slower the benchmark score, showing ReadLimiter as the main source of our pain. |
@pikrzysztof Is your benchmark repo set to private? (I am not going to run it, but since it 404s out, I figured I'd mention it to see if that was intentional or not.) |
@ocdtrekkie thanks for noticing, fixed |
You can probably work around this for the moment by creating a separate message with the same backing arena:
...which will then have its own read limit and thus not have to contend with the other threads. |
Yeah, that's more or less what I would expect. Have you tried the workaround I suggested above? |
yes, I did try it, but the code crashes. Please see the code: please find the error message below: Click to expand
|
@zenhack I have investigated the crash and it looks like the first segment in the cloned message is unset
running
allows me to read the cloned message without any problems. AIUI the issue is that |
Ah, that makes sense. Maybe instead try:
...which should properly initialize firstSeg. |
Yes, firstSeg is initialized, but does not have the same data as the original message. firstSeg gets initialized to an array of zeroes:
but the original object has the book title stored in there:
which means that the new Book object has an empty title. I don't think focusing on this particular workaround is a an efficient way of spending our time - let's focus on fixing the general use-case. My application is currently working with forked capnp which does not have ReadLimiter in performance-critical sections. |
Yeah, probably not worth continuing to fight with it. My current thinking is to add a pointer to the limiter as a field of each relevant datatype (
...and then stop using atomic operations on the read limiter, instead documenting that you need to use these methods to get a shallow copy if you're going to access a message from multiple threads. @lthibault, thoughts? Also tangentially related, @zombiezen, I have some questions re: the Message struct:
|
The other thing I am tempted to do, but also think this is maybe way to invasive: just take the read limit as a parameter to accessor methods, instead of storing it in the data types at all. e.g.
it'd be someone analogous to a |
My recollection is that it was to reduce allocations when reading a single-segment message: fb3c35a And I believe 1229c29 and 8dfabb4 have the benchmarks that I was using to justify that.
The intent was to preserve the identity of the segment once it is introduced, so |
I suppose we could add an .Is method or the like |
I don't know that it's super-important to perform the equality, but I definitely didn't want to be allocating a new segment object on the heap every time the method was called. |
Yeah, my thinking is if we can kill the pointer indirection it can just be pass by value and not allocate. |
Sounds reasonable to me 👍 |
I have an application which was reading capnproto structures from a network and feeding it to ~600 goroutines which would try analyzing it (read-only) in parallel. This caused a massive lock contention, where according to
perf
, ReadLimiter's spinlock only was responsible for 40% of cumulative CPU usage, please see flamegraph.html.gz (compressed to avoid github from stripping embedded JS)Once we adapted 1b552dd application's CPU usage dropped more than 10x, from 60 cores to <6 cores.
I think this library might benefit from replacing spinlocks with sleeping locks. Spinlocks are generally not the best choice for userspace where you can't guarantee having the CPU while holding the lock. In case your goroutine gets the lock and gets descheduled other application threads will be stupidly burning CPU cycles until both kernel and golang scheduler agree to give CPU to the goroutine holding the lock. Another disadvantage is that CPU usage poorly correlates with the amount of actual work achieved and can go up unbounded.
The text was updated successfully, but these errors were encountered: