-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NG is slower compared to Bellard version #876
Comments
The memory consumption could be explained by the poly IC and the number you showed are what, ms? They look reasonably close to margin or error? What benchmark did you run? |
The bottom numbers are "sample numbers" - the amount of samples profiler took while stack pointer was in this particular function. But yeah, it seems to get a sample every 1ms. I ran our build system on itself - the typical benchmark we do when doing changes to the language. There's a lot of JS code involved so it's hard to pinpoint the exact place. |
Are both binaries built with the (exact!) same compiler and compiler flags? Even flag order can matter. Does your benchmark contain a lot of code that runs only once? The IC infrastructure is pure overhead for such code at the moment; you pay for bookkeeping but without any payoff. It's on my todo list to introduce a IC preinit mode that delays IC creation until the second or third time the bytecode is executed. |
Yes, the only difference was that the new code use c11 instead of c99; but I excluded it by compiling the old version with c11 too.
Can be, yes. |
IC was introduced quite a while ago, in v0.2.0: #120 There is no way to disable it, currently. |
See #883 and #884. I ran some benchmarks where I, ahem, surgically removed ICs and here are the results for web-tooling-benchmark (>1.0 means ICs are faster, <1.0 means slower):
babel-minify and prettier benefit a great deal but for everything else it's either a minor win or slower; a good deal slower in case of the typescript compiler benchmark. The mean suggests they're still a net win on average, or at least, a net win when running benchmarks 🙈 I don't know if that means keep 'em, chuck 'em, or some third option. We're trading memory for CPU. Which is more important? I also experimented with pre-init ICs. They regress the babel benchmark 5x (which I expected albeit not that dramatically) but otherwise don't move the needle much, which makes sense because they're benchmarks, they don't contain much code that only runs once. |
I don't think I'm qualified enough to decide, so I'll ask some questions :-)
|
@saghul I have a feeling the benchmarks give different results based on what arch you are on, arm64 or x86. Before quickjs-ng came in existence I ran some benchmarks with ic/proto ic and noticed it made performance a lot better. But on my arm64 m4 macbook today, the difference shrinks. So i think it depends on where you are running quickjs? On an average system, IC will make a big difference compared to No IC. I can share benchmarks of different qjc variants here if that helps. |
Good point about architecture! |
Hello, It's look like we can remove it, maybe inline cache can be made optional with compile time option ? Best regards |
For the record: the benchmark numbers are from an i7 Intel system.
V8/SM/JSC:
For property accesses with a degree of polymorphism > 4, we scan the IC cache and perform a hash table lookup in case of a cache miss. Megamorphic ICs are an end state that says "too much polymorphism so don't bother and do the hash table lookup right away." They're a potential performance trap though because it's not uncommon for long-living code to have lots polymorphism at first, then stabilize when the program reaches steady state. I think most optimizing JS engines have second-chance strategies to mitigate that but fine-tuning that is finicky Goldilocks stuff: don't want to flush too often, don't want to wait too long, has to be just right.
At the moment the only thing you could tweak is the number of cache slots, currently 4. Lower that to 3 and the size of an IC on 64 bits systems drops from 56 to 44 bytes, 32 bytes when lowered to 2. I just tried with N=3. Most benchmarks are unaffected but babel-minify and typescript slow down by 40% and 46% respectively.
Yes, but no low hanging fruit. This comment is already way too long so I'll save that for another time :p |
I ran Valgrind with #884 applied It does solve the memory issue; but instruction count is only 1% lower; apparently there's something else.
One thing I want to try is reinling functions - the old version deinlined them for some reason (probably to avoid using c11); I don't think this would change anything; but worth trying to minimise the difference. |
I checked and it does, but in the wrong direction. No benchmarks improve but some regress. |
bellard/master misses a couple of correctness fixes; it doesn't record the PC ( |
@saghul after mulling it over for a bit, I think I'm in favor of removing ICs and starting from first principles again. The CPU/memory trade-off isn't quite there with the current implementation and that implementation is also kind of a straitjacket if you want to go in a completely different direction. WDYT? |
A couple of if's or a single assignment should not have this much of an effect. That's why am sceptical about inlining I mentioned earlier, but I want to exclude that, just in case. This looks like extra (re)allocations or greedy algorithm according to my experience with the profiler. |
They do when they're part of a bytecode handler. Some bytecodes execute millions or billions of times over the lifetime of a program. |
👍 let's do it. |
I agree. There are some ideas for a simpler yet more effective direction worth investigating. |
Ran some benchmarks with different variants of quickjs ./quickjs-bellard/qjs --stack-size 99999999 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 0.96 runs/s
babel: 2.78 runs/s
babel-minify: 2.82 runs/s
babylon: 2.03 runs/s
buble: 3.13 runs/s
chai: 3.95 runs/s
espree: 0.63 runs/s
esprima: 1.25 runs/s
jshint: 2.24 runs/s
lebab: 1.96 runs/s
postcss: 1.62 runs/s
prepack: 3.08 runs/s
prettier: 1.26 runs/s
source-map: 1.49 runs/s
terser: 6.37 runs/s
typescript: 4.48 runs/s
uglify-js: 1.71 runs/s
-------------------------------------
Geometric mean: 2.10 runs/s
./quickjs-ng-main/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 0.97 runs/s
babel: 2.80 runs/s
babel-minify: 2.79 runs/s
babylon: 1.95 runs/s
buble: 3.13 runs/s
chai: 3.85 runs/s
espree: 0.63 runs/s
esprima: 1.33 runs/s
jshint: 2.24 runs/s
lebab: 2.05 runs/s
postcss: 1.29 runs/s
prepack: 2.92 runs/s
prettier: 1.26 runs/s
source-map: 1.42 runs/s
terser: 6.25 runs/s
typescript: 4.33 runs/s
uglify-js: 1.20 runs/s
-------------------------------------
Geometric mean: 2.02 runs/s
./quickjs-ng-main-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 1.26 runs/s
babel: 3.61 runs/s
babel-minify: 4.46 runs/s
babylon: 2.46 runs/s
buble: 4.46 runs/s
chai: 6.80 runs/s
espree: 0.81 runs/s
esprima: 2.07 runs/s
jshint: 4.22 runs/s
lebab: 3.64 runs/s
postcss: 1.97 runs/s
prepack: 4.13 runs/s
prettier: 2.10 runs/s
source-map: 2.28 runs/s
terser: 10.53 runs/s
typescript: 5.26 runs/s
uglify-js: 3.02 runs/s
-------------------------------------
Geometric mean: 3.12 runs/s
./quickjs-ng-prototype-inline-cache/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 0.94 runs/s
babel: 2.70 runs/s
babel-minify: 2.84 runs/s
babylon: 1.95 runs/s
buble: 2.86 runs/s
chai: 3.76 runs/s
espree: 0.63 runs/s
esprima: 1.27 runs/s
jshint: 2.11 runs/s
lebab: 1.84 runs/s
postcss: 1.64 runs/s
prepack: 2.96 runs/s
prettier: 1.22 runs/s
source-map: 1.40 runs/s
terser: 5.85 runs/s
typescript: 4.12 runs/s
uglify-js: 1.14 runs/s
-------------------------------------
Geometric mean: 1.97 runs/s
./quickjs-ng-prototype-inline-cache-mimalloc/build/qjs --stack-size 2048 --script dist/cli.j
s
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 1.23 runs/s
babel: 3.48 runs/s
babel-minify: 4.31 runs/s
babylon: 2.45 runs/s
buble: 4.05 runs/s
chai: 6.76 runs/s
espree: 0.77 runs/s
esprima: 1.98 runs/s
jshint: 4.00 runs/s
lebab: 3.47 runs/s
postcss: 2.37 runs/s
prepack: 4.07 runs/s
prettier: 2.04 runs/s
source-map: 2.24 runs/s
terser: 9.62 runs/s
typescript: 4.90 runs/s
uglify-js: 2.70 runs/s
-------------------------------------
Geometric mean: 3.02 runs/s
./quickjs-preinit-ic/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 0.96 runs/s
babel: 2.78 runs/s
babel-minify: 2.69 runs/s
babylon: 1.93 runs/s
buble: 3.03 runs/s
chai: 3.75 runs/s
espree: 0.62 runs/s
esprima: 1.31 runs/s
jshint: 2.27 runs/s
lebab: 1.97 runs/s
postcss: 1.29 runs/s
prepack: 2.94 runs/s
prettier: 1.26 runs/s
source-map: 1.41 runs/s
terser: 6.41 runs/s
typescript: 4.35 runs/s
uglify-js: 1.22 runs/s
-------------------------------------
Geometric mean: 2.00 runs/s
./quickjs-preinit-ic-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 1.26 runs/s
babel: 3.60 runs/s
babel-minify: 4.42 runs/s
babylon: 2.41 runs/s
buble: 4.42 runs/s
chai: 6.71 runs/s
espree: 0.80 runs/s
esprima: 2.05 runs/s
jshint: 4.27 runs/s
lebab: 3.66 runs/s
postcss: 1.93 runs/s
prepack: 4.12 runs/s
prettier: 2.12 runs/s
source-map: 2.26 runs/s
terser: 10.42 runs/s
typescript: 5.21 runs/s
uglify-js: 2.99 runs/s
-------------------------------------
Geometric mean: 3.10 runs/s
./quickjs-rm-ic/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 0.91 runs/s
babel: 2.76 runs/s
babel-minify: 2.76 runs/s
babylon: 1.90 runs/s
buble: 3.16 runs/s
chai: 3.77 runs/s
espree: 0.62 runs/s
esprima: 1.28 runs/s
jshint: 2.29 runs/s
lebab: 1.95 runs/s
postcss: 1.72 runs/s
prepack: 3.00 runs/s
prettier: 1.24 runs/s
source-map: 1.41 runs/s
terser: 6.06 runs/s
typescript: 2.41 runs/s
uglify-js: 1.65 runs/s
-------------------------------------
Geometric mean: 1.99 runs/s
./quickjs-rm-ic-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
acorn: 1.23 runs/s
babel: 3.55 runs/s
babel-minify: 4.44 runs/s
babylon: 2.40 runs/s
buble: 4.48 runs/s
chai: 6.76 runs/s
espree: 0.78 runs/s
esprima: 1.98 runs/s
jshint: 4.27 runs/s
lebab: 3.64 runs/s
postcss: 2.47 runs/s
prepack: 4.20 runs/s
prettier: 2.11 runs/s
source-map: 2.22 runs/s
terser: 10.53 runs/s
typescript: 3.50 runs/s
uglify-js: 3.01 runs/s
-------------------------------------
Geometric mean: 3.06 runs/s |
Some more benchmarks:
|
I have also noticed that quickjs-ng performance is slower even when compiled with the same toolchain on the same platform. Since some of the comparison test engines only support es5, so only use the v8-v7 test suite https://github.com/[ahaoboy/js-engine-benchmark](https://github.com/ahaoboy/js-engine-benchmark) Both tjs and llrt use quickjs-ng, but their test performance is much better, I don't know why
|
@ahaoboy can you try running the benchmark with removed ics PR? |
When poly_ic was added to quickjs-ng, at that time it was indeed much faster than it is today: ./quickjs-ng-poly_ic/build/qjs ./combined.js
Box2D: 6911
CodeLoad: 21546
PdfJS: 5847
Typescript: 21079
Mandreel: 1972
MandreelLatency: 14662
Richards: 1797
DeltaBlue: 1299
Crypto: 2127
RayTrace: 1129
EarleyBoyer: 2286
RegExp: 402
Splay: 1962
SplayLatency: 7099
NavierStokes: 4541
----
Score (version 9): 3541
WITH MIMALLOC
env DYLD_INSERT_LIBRARIES=/opt/homebrew/lib/libmimalloc.dylib ./quickjs-ng-poly_ic/build/qjs ./combined.js
Box2D: 7989
CodeLoad: 33395
PdfJS: 7377
Typescript: 26486
Mandreel: 2003
MandreelLatency: 14926
Richards: 1822
DeltaBlue: 1793
Crypto: 2149
RayTrace: 2440
EarleyBoyer: 3968
RegExp: 538
Splay: 3932
SplayLatency: 13325
NavierStokes: 4515
----
Score (version 9): 4735 So I think the preformance regression is probably not due to adding poly_ic to quickjs-ng but instead some other change that was done along the way? |
I tested it several times, and the scores fluctuated a bit, but the ranking remained basically unchanged. v8-v7 is a benchmark for es5, but most of the code may use es6, maybe the test results for es6 will be different?
|
To make a correct comparison (statistically correct), which take into account uncertainties. It is better to collect up to 10-20 results per binary and feed the results to |
Inline caches are supposed to improve performance but benchmarks show it's currently a mixed bag: sometimes faster, sometimes slower, always more memory hungry. After due consideration, I think it's better to remove the current implementation and start afresh. Refs: #876
@xeioex I ran it many times, the difference is consistently there. |
Valgrind results on top of 6b78c7f (IC added)
Valgrind results on top of 5c3077e (commit before IC added) new/old instructions = 0,9932%, so indeed IC added some overhead; but it's not that big
This proves that the instructions count regression was added later. |
Can it be bace4f6? |
What benchmark or benchmarks did you use? I can see that commit slowing down parsing but not runtime performance. A suite like web-tooling-benchmark should be largely unaffected. |
Adding column number bytecodes might affect the peephole optimizer... |
Yes, we evaluate small peaces of JS code and parse them all the time. |
I don't think it's that. There's no separate opcode for column numbers. I replaced OP_line_num with an opcode that stores both line and column. resolve_labels and code_match filter those out, same as before.
Right, I can see how that would regress performance because the parser has to do more work now. I consider that a correctness fix though so don't expect it to be rolled back. However, maybe it's possible to make it faster. |
I double-checked and yes, bace4f6 introduces regression. Can you please speed up it a bit? |
Aren't we talking ~1% perf difference here? |
9761057471 / 9396116795 = 1,0388, closer to 4%. But yeah, having columns is a nice feature and if it hard or impossible to speed up, we will have to live with that slowdown. |
So, I ran Valgrind with 2 engines and it seems that NG is about 3% slower and also consumes more memory
Results seem to be consistent between different runs.
However, attempt to use sampling profiler on MacOS does not provide consistent results making hard to track down the reason of degradation.
Some suspects are (the numbers are sample counts in 3 different runs):
Old
js_create_function 32/28/42
__JS_EvalInternal 87/73/81
New
js_create_function 39/36/39
__JS_EvalInternal 105/84/83
It seems that both functions are a bit slower, but that could be natural variance between runs.
Bisecting this would be harder this time, but if there's no other way, I can try doing so.
In the meantime, maybe you have any hints what could cause regression?
The text was updated successfully, but these errors were encountered: