-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mycpp: add 128B pool #1958
base: soil-staging
Are you sure you want to change the base?
mycpp: add 128B pool #1958
Conversation
This improves performances on memory-bound workloads.
@@ -265,10 +265,12 @@ class MarkSweepHeap { | |||
#ifndef NO_POOL_ALLOC | |||
// 16,384 / 24 bytes = 682 cells (rounded), 16,368 bytes | |||
// 16,384 / 48 bytes = 341 cells (rounded), 16,368 bytes | |||
// 16,384 / 96 bytes = 171 cells (rounded), 16,368 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm this still says 96
I kinda want to do some evaluation, make sure it's not over-tuned for Linux / glibc / 64-bit
And does this waste memory on any workloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we had the theory again that malloc() needed some slack, to be strictly less than 16,384 bytes, to allow for headers
But I guess we never measured that effect
We already run on many different libc with different malloc(), so yeah I think we can at least survey Alpine Linux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh whoops. That's a typo leftover from a previous version. Will fix.
Happy to leave this open until we have a better idea of the effect with different allocators. I suspect that it will be a win in most cases, though. Since the write to the allocation header (whether it's below the buffer or elsewhere) is going to cause page faults (the reduction of which was the main win from this) regardless.
OK if leaving this open isn't blocking anything, it will be a good opportunity to test out my new performance harness It bugged me that the CI is not good enough to really measure performance -- we need other hardware and other systems I think a lot of projects actually have real hardware labs for this, e.g. the Go language has a bunch of machines all over. But we can probably do pretty good on a small budget :) |
BTW I still want to merge this, but I want to use it as an example for some perf evaluation on different machines with different libc allocators |
Oh this was accidentally closed because I deleted the |
OK it merged cleanly, so I pushed it to CI again I was thinking about the pools again, but I don't remember why Oh yeah it was because Aidan and I were investigating a word splitting bug. And I was thinking about how to do fewer allocs in word splitting, which is hot And it felt like we needed more 32-byte Slices, like our 32-byte Tokens. I was wondering if we make it 24,32,48, or 24,48,128, etc. i.e. there is still a "hot" part of the code that isn't "done" yet, because there are IFS bugs I also felt like this depends a lot on the specific machine. e.g. a 128 byte pool could be more of a de-optimization for a 32-bit machine. Although those are getting rarer and rarer |
On the cachegrind benchmarks, this does help a little http://op.oilshell.org/uuu/github-jobs/8247/benchmarks2.wwz/_tmp/gc-cachegrind/index.html http://op.oilshell.org/uuu/github-jobs/8249/benchmarks2.wwz/_tmp/gc-cachegrind/index.html parsing
to
fibonacci
and
I guess one reason I've been hesistant is that we can be using more memory, and not notice it on the benchmarks It is common to trade speed for memory, and then memory doesn't get measured I hope we can do the IFS fixes, and then see where word eval is at. Actually I checked in some word eval benchmarks the other day: https://oilshell.zulipchat.com/#narrow/channel/121539-oil-dev/topic/word.20split.20benchmarks we're actually faster than bash! But slower than other shells Word eval is known to be hot, and speeding it up even sped up CPython configure as far as I remember. So I hope we can get that in, and then compare say 24/32/48 vs 24/48/128 Although we will still have the memory issue then, and it's still true that we can be over-fitting for certain hardware |
Also, I could be missing something in terms of the benchmarks, since there are so many dimensions and workloads I'm sure there are other ways to look at this |
This improves performances on memory-bound workloads and helps reudce page faults in workloads like CPython configure.