-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype: New epoch algorithm #963
base: master
Are you sure you want to change the base?
Conversation
Thanks for suggesting an interesting approach! I conducted the built-in benchmarks in crossbeam-epoch on my ARM machine and got the following results.
As we can see, I think |
Hi! I've come up with a somewhat different epoch algorithm, which performs very similarly to the current one while being much simpler. (It also fixes #551 and might help with #869.) It might need some performance tuning on Linux, Windows, or weakly ordered architectures, but I'm curious to know what you think of the approach, or if you have any ideas to make it faster.
Unlike the current algorithm, it uses a fixed number of "pinned" indicators, instead of one per thread. Using more indicators is less helpful against contention as the number of them increases, especially when there are more of them than cores. (An interesting experiment would be to pick one based on
sched_getcpu()
. I didn't try this because my system doesn't support it.)Also unlike the current algorithm, it uses the ordering of epochs to ensure that garbage can't be simultaneously added and removed for the same epoch. This greatly simplified storing the garbage, because these operations then don't have to be thread-safe with each other.
Finally, it doesn't use any memory ordering stronger than acquire or release. In my opinion this makes it easier to reason about. (It might help performance on ARM, but I don't have one to test it on.)
Internally it uses an approach similar to a RwLock, with reference counters which stores the write reference in the high bit and read references in the low bits. Here's how it works in detail:
Steps
To pin a thread
Once the local buffer of deferred functions is full enough
To advance the epoch (while pinned)
Reference counter
The reference counter is divided into 16 shards.
To read-lock, pick a shard and acquire-increment. If the high bit is set, fail, and if the next-to-high bit is set, panic (this indicates an overflow). To read-unlock, release-decrement the same shard.
To write-lock, attempt to acquire-CAS each counter from 0 to 0 plus the high bit. If the original value wasn't 0 or HIGH_BIT, fail. If the final counter's original value wasn't 0, fail. This allow writers which failed after setting some counters to not cause deadlocks, but the final counter decides which writer wins. To write-unlock, release-set all counters to 0.
Proof sketch
As with the classic epoch algorithm, each epoch overlaps the one before and after it (which is required for wait-freedom), but everything in epoch n happens-before everything in epoch n+2. This is because the advancing thread in epoch n+1 write-locks (acquire) epoch n then write-unlocks (release) epoch n+2. A thread in n+2 must advance the epoch to n+3, so n+3 happens-after n, and so on. Thus the latest epoch that could have observed pointers that epoch n unlinked from the data structure is n+1.
Since the advancing thread in n+2 write-locks n+1, it happens-after it as well, thus it happens-after any uses of those pointers, and they are safe to delete. In addition, the advancing thread in n+1 write-locked n, so it is already write-locked from the point of view of the advancing thread in n+2, and no one will touch the n's garbage pile until it's unlocked. (I can draw a diagram if it helps.)