Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NG is slower compared to Bellard version #876

Open
ABBAPOH opened this issue Feb 2, 2025 · 38 comments
Open

NG is slower compared to Bellard version #876

ABBAPOH opened this issue Feb 2, 2025 · 38 comments

Comments

@ABBAPOH
Copy link
Contributor

ABBAPOH commented Feb 2, 2025

So, I ran Valgrind with 2 engines and it seems that NG is about 3% slower and also consumes more memory

========== Performance data for Resolving ==========
    Old instruction count: 9396116795
    New instruction count: 9761057471
    Relative change: +3 %
    Old peak memory usage: 118112672 Bytes
    New peak memory usage: 120718720 Bytes
    Relative change: +2 %

Results seem to be consistent between different runs.

However, attempt to use sampling profiler on MacOS does not provide consistent results making hard to track down the reason of degradation.
Some suspects are (the numbers are sample counts in 3 different runs):
Old
js_create_function 32/28/42
__JS_EvalInternal 87/73/81

New
js_create_function 39/36/39
__JS_EvalInternal 105/84/83
It seems that both functions are a bit slower, but that could be natural variance between runs.
Bisecting this would be harder this time, but if there's no other way, I can try doing so.

In the meantime, maybe you have any hints what could cause regression?

@saghul
Copy link
Contributor

saghul commented Feb 2, 2025

The memory consumption could be explained by the poly IC and the number you showed are what, ms? They look reasonably close to margin or error?

What benchmark did you run?

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 2, 2025

The bottom numbers are "sample numbers" - the amount of samples profiler took while stack pointer was in this particular function. But yeah, it seems to get a sample every 1ms.
It does seem to be random, that's why I am not sure this is related to eval/js_create_function.
OTOH the overall instruction count is way bigger according to Valgrind - even 1% is concerning.

I ran our build system on itself - the typical benchmark we do when doing changes to the language. There's a lot of JS code involved so it's hard to pinpoint the exact place.

@bnoordhuis
Copy link
Contributor

Are both binaries built with the (exact!) same compiler and compiler flags? Even flag order can matter.

Does your benchmark contain a lot of code that runs only once? The IC infrastructure is pure overhead for such code at the moment; you pay for bookkeeping but without any payoff. It's on my todo list to introduce a IC preinit mode that delays IC creation until the second or third time the bytecode is executed.

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 3, 2025

Are both binaries built with the (exact!) same compiler and compiler flags? Even flag order can matter.

Yes, the only difference was that the new code use c11 instead of c99; but I excluded it by compiling the old version with c11 too.

Does your benchmark contain a lot of code that runs only once?

Can be, yes.
Can you please give the SHA of the commit that introduced the IC stuff so I can try benchmarking before it was introduced? Otherwise I will do a full bisect search.
Or can I disable the feature somehow?

@saghul
Copy link
Contributor

saghul commented Feb 3, 2025

IC was introduced quite a while ago, in v0.2.0: #120

There is no way to disable it, currently.

This was referenced Feb 3, 2025
@bnoordhuis
Copy link
Contributor

bnoordhuis commented Feb 3, 2025

See #883 and #884. I ran some benchmarks where I, ahem, surgically removed ICs and here are the results for web-tooling-benchmark (>1.0 means ICs are faster, <1.0 means slower):

       acorn: 1.024
       babel: 0.9386
babel-minify: 1.419
     babylon: 0.9375
       buble: 0.9521
        chai: 0.9507
      espree: 1.033
     esprima: 1.055
      jshint: 0.9683
       lebab: 0.9649
     postcss: 0.9714
     prepack: 0.9403
    prettier: 1.543
  source-map: 1.061
      terser: 1.0
  typescript: 0.8935
   uglify-js: 0.9759
=====================
        mean: 1.0367

babel-minify and prettier benefit a great deal but for everything else it's either a minor win or slower; a good deal slower in case of the typescript compiler benchmark.

The mean suggests they're still a net win on average, or at least, a net win when running benchmarks 🙈

I don't know if that means keep 'em, chuck 'em, or some third option. We're trading memory for CPU. Which is more important?

I also experimented with pre-init ICs. They regress the babel benchmark 5x (which I expected albeit not that dramatically) but otherwise don't move the needle much, which makes sense because they're benchmarks, they don't contain much code that only runs once.

@saghul
Copy link
Contributor

saghul commented Feb 4, 2025

I don't think I'm qualified enough to decide, so I'll ask some questions :-)

  • Is there something left on the table that would make the IC work "shine" more?
  • Does Add prototype inline cache #551 move the needle in any direction?
  • Intuitively I guess we do want both (of course) best CPU and memory usage
    • CPU in cases like lambda functions
    • Memory in constrained environments like IoT?
  • IIRC all "big" engines have this. What does it mean? Do they rely on JIT to make the performance leap?
  • Would it be hard to have a way to tune it, so those contrained environments could tweak it to their needs? Does that even make sense?

@ammarahm-ed
Copy link
Contributor

ammarahm-ed commented Feb 4, 2025

@saghul I have a feeling the benchmarks give different results based on what arch you are on, arm64 or x86. Before quickjs-ng came in existence I ran some benchmarks with ic/proto ic and noticed it made performance a lot better. But on my arm64 m4 macbook today, the difference shrinks. So i think it depends on where you are running quickjs?

On an average system, IC will make a big difference compared to No IC.

I can share benchmarks of different qjc variants here if that helps.

@saghul
Copy link
Contributor

saghul commented Feb 4, 2025

Good point about architecture!

@gschwind
Copy link
Contributor

gschwind commented Feb 4, 2025

Hello,

It's look like we can remove it, maybe inline cache can be made optional with compile time option ?

Best regards

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 4, 2025

See #883 and #884

Thanks, I'll try these!

In the meantime, I tried applying our changes on top of #120, but recent version of changes gives too many conflicts, so I'd need to try a different approach.

@saghul
Copy link
Contributor

saghul commented Feb 4, 2025

In the meantime, I tried applying our changes on top of #120, but recent version of changes gives too many conflicts, so I'd need to try a different approach.

Applying #884 would be the equivalent to the state of the repo pre-120, but with all current changes.

@bnoordhuis
Copy link
Contributor

For the record: the benchmark numbers are from an i7 Intel system.

IIRC all "big" engines have this. What does it mean? Do they rely on JIT to make the performance leap?

V8/SM/JSC:

  1. use ICs to collect type feedback for their optimizing tiers; that's more important to them than optimizing property access at the lowest tier because hot code doesn't stay at that tier long

  2. have {mono,poly,mega}morphic ICs whereas we only have mono/poly

For property accesses with a degree of polymorphism > 4, we scan the IC cache and perform a hash table lookup in case of a cache miss.

Megamorphic ICs are an end state that says "too much polymorphism so don't bother and do the hash table lookup right away."

They're a potential performance trap though because it's not uncommon for long-living code to have lots polymorphism at first, then stabilize when the program reaches steady state.

I think most optimizing JS engines have second-chance strategies to mitigate that but fine-tuning that is finicky Goldilocks stuff: don't want to flush too often, don't want to wait too long, has to be just right.

Would it be hard to have a way to tune it, so those contrained environments could tweak it to their needs?

At the moment the only thing you could tweak is the number of cache slots, currently 4. Lower that to 3 and the size of an IC on 64 bits systems drops from 56 to 44 bytes, 32 bytes when lowered to 2.

I just tried with N=3. Most benchmarks are unaffected but babel-minify and typescript slow down by 40% and 46% respectively.

Is there something left on the table that would make the IC work "shine" more?

Yes, but no low hanging fruit. This comment is already way too long so I'll save that for another time :p

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 4, 2025

I ran Valgrind with #884 applied

It does solve the memory issue; but instruction count is only 1% lower; apparently there's something else.

========== Performance data for Resolving ==========
    Old instruction count: 9394171800
    New instruction count: 9624124580
    Relative change: +2 %
    Old peak memory usage: 118016496 Bytes
    New peak memory usage: 117458384 Bytes
    Relative change: 0 %

One thing I want to try is reinling functions - the old version deinlined them for some reason (probably to avoid using c11); I don't think this would change anything; but worth trying to minimise the difference.

@bnoordhuis
Copy link
Contributor

I checked and it does, but in the wrong direction. No benchmarks improve but some regress.

@bnoordhuis
Copy link
Contributor

It does solve the memory issue; but instruction count is only 1% lower; apparently there's something else.

bellard/master misses a couple of correctness fixes; it doesn't record the PC (sf->cur_pc = pc) in enough places. quickjs-ng does but that comes at a cost.

@bnoordhuis
Copy link
Contributor

@saghul after mulling it over for a bit, I think I'm in favor of removing ICs and starting from first principles again.

The CPU/memory trade-off isn't quite there with the current implementation and that implementation is also kind of a straitjacket if you want to go in a completely different direction.

WDYT?

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 4, 2025

bellard/master misses a couple of correctness fixes; it doesn't record the PC (sf->cur_pc = pc) in enough places. quickjs-ng does but that comes at a cost.

A couple of if's or a single assignment should not have this much of an effect. That's why am sceptical about inlining I mentioned earlier, but I want to exclude that, just in case.

This looks like extra (re)allocations or greedy algorithm according to my experience with the profiler.

@bnoordhuis
Copy link
Contributor

A couple of if's or a single assignment should not have this much of an effect.

They do when they're part of a bytecode handler. Some bytecodes execute millions or billions of times over the lifetime of a program.

@saghul
Copy link
Contributor

saghul commented Feb 4, 2025

@saghul after mulling it over for a bit, I think I'm in favor of removing ICs and starting from first principles again.

The CPU/memory trade-off isn't quite there with the current implementation and that implementation is also kind of a straitjacket if you want to go in a completely different direction.

WDYT?

👍 let's do it.

@chqrlie
Copy link
Collaborator

chqrlie commented Feb 4, 2025

@saghul after mulling it over for a bit, I think I'm in favor of removing ICs and starting from first principles again.

The CPU/memory trade-off isn't quite there with the current implementation and that implementation is also kind of a straitjacket if you want to go in a completely different direction.

WDYT?

I agree. There are some ideas for a simpler yet more effective direction worth investigating.

@ammarahm-ed
Copy link
Contributor

Ran some benchmarks with different variants of quickjs

 ./quickjs-bellard/qjs --stack-size 99999999 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.96 runs/s
         babel:  2.78 runs/s
  babel-minify:  2.82 runs/s
       babylon:  2.03 runs/s
         buble:  3.13 runs/s
          chai:  3.95 runs/s
        espree:  0.63 runs/s
       esprima:  1.25 runs/s
        jshint:  2.24 runs/s
         lebab:  1.96 runs/s
       postcss:  1.62 runs/s
       prepack:  3.08 runs/s
      prettier:  1.26 runs/s
    source-map:  1.49 runs/s
        terser:  6.37 runs/s
    typescript:  4.48 runs/s
     uglify-js:  1.71 runs/s
-------------------------------------
Geometric mean:  2.10 runs/s

./quickjs-ng-main/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.97 runs/s
         babel:  2.80 runs/s
  babel-minify:  2.79 runs/s
       babylon:  1.95 runs/s
         buble:  3.13 runs/s
          chai:  3.85 runs/s
        espree:  0.63 runs/s
       esprima:  1.33 runs/s
        jshint:  2.24 runs/s
         lebab:  2.05 runs/s
       postcss:  1.29 runs/s
       prepack:  2.92 runs/s
      prettier:  1.26 runs/s
    source-map:  1.42 runs/s
        terser:  6.25 runs/s
    typescript:  4.33 runs/s
     uglify-js:  1.20 runs/s
-------------------------------------
Geometric mean:  2.02 runs/s

./quickjs-ng-main-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  1.26 runs/s
         babel:  3.61 runs/s
  babel-minify:  4.46 runs/s
       babylon:  2.46 runs/s
         buble:  4.46 runs/s
          chai:  6.80 runs/s
        espree:  0.81 runs/s
       esprima:  2.07 runs/s
        jshint:  4.22 runs/s
         lebab:  3.64 runs/s
       postcss:  1.97 runs/s
       prepack:  4.13 runs/s
      prettier:  2.10 runs/s
    source-map:  2.28 runs/s
        terser: 10.53 runs/s
    typescript:  5.26 runs/s
     uglify-js:  3.02 runs/s
-------------------------------------
Geometric mean:  3.12 runs/s

./quickjs-ng-prototype-inline-cache/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.94 runs/s
         babel:  2.70 runs/s
  babel-minify:  2.84 runs/s
       babylon:  1.95 runs/s
         buble:  2.86 runs/s
          chai:  3.76 runs/s
        espree:  0.63 runs/s
       esprima:  1.27 runs/s
        jshint:  2.11 runs/s
         lebab:  1.84 runs/s
       postcss:  1.64 runs/s
       prepack:  2.96 runs/s
      prettier:  1.22 runs/s
    source-map:  1.40 runs/s
        terser:  5.85 runs/s
    typescript:  4.12 runs/s
     uglify-js:  1.14 runs/s
-------------------------------------
Geometric mean:  1.97 runs/s

./quickjs-ng-prototype-inline-cache-mimalloc/build/qjs --stack-size 2048 --script dist/cli.j
s
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  1.23 runs/s
         babel:  3.48 runs/s
  babel-minify:  4.31 runs/s
       babylon:  2.45 runs/s
         buble:  4.05 runs/s
          chai:  6.76 runs/s
        espree:  0.77 runs/s
       esprima:  1.98 runs/s
        jshint:  4.00 runs/s
         lebab:  3.47 runs/s
       postcss:  2.37 runs/s
       prepack:  4.07 runs/s
      prettier:  2.04 runs/s
    source-map:  2.24 runs/s
        terser:  9.62 runs/s
    typescript:  4.90 runs/s
     uglify-js:  2.70 runs/s
-------------------------------------
Geometric mean:  3.02 runs/s

./quickjs-preinit-ic/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.96 runs/s
         babel:  2.78 runs/s
  babel-minify:  2.69 runs/s
       babylon:  1.93 runs/s
         buble:  3.03 runs/s
          chai:  3.75 runs/s
        espree:  0.62 runs/s
       esprima:  1.31 runs/s
        jshint:  2.27 runs/s
         lebab:  1.97 runs/s
       postcss:  1.29 runs/s
       prepack:  2.94 runs/s
      prettier:  1.26 runs/s
    source-map:  1.41 runs/s
        terser:  6.41 runs/s
    typescript:  4.35 runs/s
     uglify-js:  1.22 runs/s
-------------------------------------
Geometric mean:  2.00 runs/s

./quickjs-preinit-ic-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  1.26 runs/s
         babel:  3.60 runs/s
  babel-minify:  4.42 runs/s
       babylon:  2.41 runs/s
         buble:  4.42 runs/s
          chai:  6.71 runs/s
        espree:  0.80 runs/s
       esprima:  2.05 runs/s
        jshint:  4.27 runs/s
         lebab:  3.66 runs/s
       postcss:  1.93 runs/s
       prepack:  4.12 runs/s
      prettier:  2.12 runs/s
    source-map:  2.26 runs/s
        terser: 10.42 runs/s
    typescript:  5.21 runs/s
     uglify-js:  2.99 runs/s
-------------------------------------
Geometric mean:  3.10 runs/s

./quickjs-rm-ic/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.91 runs/s
         babel:  2.76 runs/s
  babel-minify:  2.76 runs/s
       babylon:  1.90 runs/s
         buble:  3.16 runs/s
          chai:  3.77 runs/s
        espree:  0.62 runs/s
       esprima:  1.28 runs/s
        jshint:  2.29 runs/s
         lebab:  1.95 runs/s
       postcss:  1.72 runs/s
       prepack:  3.00 runs/s
      prettier:  1.24 runs/s
    source-map:  1.41 runs/s
        terser:  6.06 runs/s
    typescript:  2.41 runs/s
     uglify-js:  1.65 runs/s
-------------------------------------
Geometric mean:  1.99 runs/s

./quickjs-rm-ic-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  1.23 runs/s
         babel:  3.55 runs/s
  babel-minify:  4.44 runs/s
       babylon:  2.40 runs/s
         buble:  4.48 runs/s
          chai:  6.76 runs/s
        espree:  0.78 runs/s
       esprima:  1.98 runs/s
        jshint:  4.27 runs/s
         lebab:  3.64 runs/s
       postcss:  2.47 runs/s
       prepack:  4.20 runs/s
      prettier:  2.11 runs/s
    source-map:  2.22 runs/s
        terser: 10.53 runs/s
    typescript:  3.50 runs/s
     uglify-js:  3.01 runs/s
-------------------------------------
Geometric mean:  3.06 runs/s

@ammarahm-ed
Copy link
Contributor

Some more benchmarks:

Running benchmark for ./quickjs-bellard/qjs...
Box2D: 6317
CodeLoad: 22071
PdfJS: 5683
Typescript: 20323
Mandreel: 2043
MandreelLatency: 14837
Richards: 1640
DeltaBlue: 1269
Crypto: 2100
RayTrace: 1092
EarleyBoyer: 2275
RegExp: 401
Splay: 2035
SplayLatency: 8082
NavierStokes: 4373
----
Score (version 9): 3513
Running benchmark for ./quickjs-ng-main/build/qjs...
Box2D: 6958
CodeLoad: 19877
PdfJS: 5780
Typescript: 21006
Mandreel: 1538
MandreelLatency: 7364
Richards: 1806
DeltaBlue: 1437
Crypto: 1853
RayTrace: 1071
EarleyBoyer: 2244
RegExp: 387
Splay: 2518
SplayLatency: 8640
NavierStokes: 3512
----
Score (version 9): 3318
Running benchmark for ./quickjs-ng-main-mimalloc/build/qjs...
Box2D: 8037
CodeLoad: 32181
PdfJS: 7206
Typescript: 26421
Mandreel: 1526
MandreelLatency: 9699
Richards: 1790
DeltaBlue: 1679
Crypto: 1859
RayTrace: 2635
EarleyBoyer: 4103
RegExp: 524
Splay: 5044
SplayLatency: 13698
NavierStokes: 3488
----
Score (version 9): 4463
Running benchmark for ./quickjs-ng-prototype-inline-cache/build/qjs...
Box2D: 6720
CodeLoad: 19421
PdfJS: 5463
Typescript: 19538
Mandreel: 1477
MandreelLatency: 8870
Richards: 1864
DeltaBlue: 1513
Crypto: 1800
RayTrace: 1121
EarleyBoyer: 2226
RegExp: 380
Splay: 1830
SplayLatency: 6994
NavierStokes: 3376
----
Score (version 9): 3201
Running benchmark for ./quickjs-ng-prototype-inline-cache-mimalloc/build/qjs...
Box2D: 7576
CodeLoad: 31319
PdfJS: 7052
Typescript: 24135
Mandreel: 1479
MandreelLatency: 9569
Richards: 1859
DeltaBlue: 1852
Crypto: 1797
RayTrace: 2524
EarleyBoyer: 3691
RegExp: 518
Splay: 4119
SplayLatency: 13863
NavierStokes: 3383
----
Score (version 9): 4310
Running benchmark for ./quickjs-preinit-ic/build/qjs...
Box2D: 6830
CodeLoad: 19352
PdfJS: 5688
Typescript: 20342
Mandreel: 1525
MandreelLatency: 10652
Richards: 1763
DeltaBlue: 1272
Crypto: 1841
RayTrace: 1099
EarleyBoyer: 2238
RegExp: 392
Splay: 2532
SplayLatency: 8825
NavierStokes: 3425
----
Score (version 9): 3352
Running benchmark for ./quickjs-preinit-ic-mimalloc/build/qjs...
Box2D: 7927
CodeLoad: 31130
PdfJS: 7124
Typescript: 25895
Mandreel: 1523
MandreelLatency: 9756
Richards: 1742
DeltaBlue: 1735
Crypto: 1854
RayTrace: 2659
EarleyBoyer: 4108
RegExp: 514
Splay: 4987
SplayLatency: 13363
NavierStokes: 3451
----
Score (version 9): 4426
Running benchmark for ./quickjs-rm-ic/build/qjs...
Box2D: 6065
CodeLoad: 20605
PdfJS: 5448
Typescript: 19698
Mandreel: 1541
MandreelLatency: 9048
Richards: 1546
DeltaBlue: 1365
Crypto: 1826
RayTrace: 1101
EarleyBoyer: 2286
RegExp: 389
Splay: 2505
SplayLatency: 9418
NavierStokes: 3451
----
Score (version 9): 3291
Running benchmark for ./quickjs-rm-ic-mimalloc/build/qjs...
Box2D: 6884
CodeLoad: 32655
PdfJS: 6821
Typescript: 24558
Mandreel: 1540
MandreelLatency: 11103
Richards: 1557
DeltaBlue: 1510
Crypto: 1845
RayTrace: 2630
EarleyBoyer: 4110
RegExp: 512
Splay: 4922
SplayLatency: 15831
NavierStokes: 3478
----
Score (version 9): 4380

@ahaoboy
Copy link

ahaoboy commented Feb 5, 2025

I have also noticed that quickjs-ng performance is slower even when compiled with the same toolchain on the same platform.

Since some of the comparison test engines only support es5, so only use the v8-v7 test suite

https://github.com/[ahaoboy/js-engine-benchmark](https://github.com/ahaoboy/js-engine-benchmark)

Both tjs and llrt use quickjs-ng, but their test performance is much better, I don't know why

Engine llrt tjs qjs qjs(ng)
Version 0.4.0-beta 24.6.0 2024-02-14 0.8.0
Total size 10.6M 5.1M 4.6M 2.1M
Exe size 10.6M 5.1M 4.6M 2.1M
Dll size 0 0 0 0
Richards 828 681 708 630
DeltaBlue 740 668 685 612
Crypto 690 606 763 381
RayTrace 1210 1113 916 702
EarleyBoyer 2009 1783 1476 1219
RegExp 198 254 228 181
Splay 1734 1956 1718 1115
NavierStokes 1158 1043 1322 971
Score 894 854 845 629

@ammarahm-ed
Copy link
Contributor

@ahaoboy can you try running the benchmark with removed ics PR?

@ammarahm-ed
Copy link
Contributor

When poly_ic was added to quickjs-ng, at that time it was indeed much faster than it is today:

./quickjs-ng-poly_ic/build/qjs ./combined.js
Box2D: 6911
CodeLoad: 21546
PdfJS: 5847
Typescript: 21079
Mandreel: 1972
MandreelLatency: 14662
Richards: 1797
DeltaBlue: 1299
Crypto: 2127
RayTrace: 1129
EarleyBoyer: 2286
RegExp: 402
Splay: 1962
SplayLatency: 7099
NavierStokes: 4541
----
Score (version 9): 3541


WITH MIMALLOC
env DYLD_INSERT_LIBRARIES=/opt/homebrew/lib/libmimalloc.dylib ./quickjs-ng-poly_ic/build/qjs ./combined.js
Box2D: 7989
CodeLoad: 33395
PdfJS: 7377
Typescript: 26486
Mandreel: 2003
MandreelLatency: 14926
Richards: 1822
DeltaBlue: 1793
Crypto: 2149
RayTrace: 2440
EarleyBoyer: 3968
RegExp: 538
Splay: 3932
SplayLatency: 13325
NavierStokes: 4515
----
Score (version 9): 4735

So I think the preformance regression is probably not due to adding poly_ic to quickjs-ng but instead some other change that was done along the way?

@ahaoboy
Copy link

ahaoboy commented Feb 6, 2025

@ahaoboy can you try running the benchmark with removed ics PR?

I tested it several times, and the scores fluctuated a bit, but the ranking remained basically unchanged.

v8-v7 is a benchmark for es5, but most of the code may use es6, maybe the test results for es6 will be different?

Engine ./qjs-bellard ./qjs-rm-ic ./qjs-ng
Version 0 0 0
Total size 4.6M 1.4M 1.4M
Exe size 4.6M 1.4M 1.4M
Dll size 0 0 0
Richards 863 722 729
DeltaBlue 868 827 849
Crypto 946 481 428
RayTrace 1079 1226 1064
EarleyBoyer 1716 1757 1609
RegExp 226 192 190
Splay 1693 2088 2123
NavierStokes 1568 1056 887
Score 971 846 797

@xeioex
Copy link
Contributor

xeioex commented Feb 6, 2025

@ahaoboy, @ammarahm-ed

To make a correct comparison (statistically correct), which take into account uncertainties. It is better to collect up to 10-20 results per binary and feed the results to ministat.

bnoordhuis added a commit that referenced this issue Feb 6, 2025
Inline caches are supposed to improve performance but benchmarks show
it's currently a mixed bag: sometimes faster, sometimes slower, always
more memory hungry. After due consideration, I think it's better to
remove the current implementation and start afresh.

Refs: #876
@ammarahm-ed
Copy link
Contributor

@xeioex I ran it many times, the difference is consistently there.

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 9, 2025

Valgrind results on top of 6b78c7f (IC added)
new/old instructions = 1,0083%, difference is negligible

========== Performance data for Resolving ==========
    Old instruction count: 9414388368
    New instruction count: 9492696521
    Relative change: 0 %
    Old peak memory usage: 117179024 Bytes
    New peak memory usage: 121025176 Bytes
    Relative change: +3 %

Valgrind results on top of 5c3077e (commit before IC added)

new/old instructions = 0,9932%, so indeed IC added some overhead; but it's not that big

========== Performance data for Resolving ==========
    Old instruction count: 9387259784
    New instruction count: 9323831689
    Relative change: 0 %
    Old peak memory usage: 118016408 Bytes
    New peak memory usage: 117787240 Bytes
    Relative change: 0 %

This proves that the instructions count regression was added later.
Memory regression is indeed is due to IC, but I don't think this is a big problem for us, especially if this feature can be disabled in the future.
I'll try pinpointing the exact commit.

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 9, 2025

Can it be bace4f6?
Bisect points to it unless I did something wrong.

@bnoordhuis
Copy link
Contributor

What benchmark or benchmarks did you use? I can see that commit slowing down parsing but not runtime performance. A suite like web-tooling-benchmark should be largely unaffected.

@chqrlie
Copy link
Collaborator

chqrlie commented Feb 9, 2025

Adding column number bytecodes might affect the peephole optimizer...

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 9, 2025

What benchmark or benchmarks did you use? I can see that commit slowing down parsing but not runtime performance. A suite like web-tooling-benchmark should be largely unaffected.

Yes, we evaluate small peaces of JS code and parse them all the time.
https://github.com/qbs/qbs/blob/master/src/lib/corelib/language/scriptengine.cpp#L841-L845

@bnoordhuis
Copy link
Contributor

Adding column number bytecodes might affect the peephole optimizer...

I don't think it's that. There's no separate opcode for column numbers.

I replaced OP_line_num with an opcode that stores both line and column. resolve_labels and code_match filter those out, same as before.

Yes, we evaluate small peaces of JS code and parse them all the time.

Right, I can see how that would regress performance because the parser has to do more work now. I consider that a correctness fix though so don't expect it to be rolled back. However, maybe it's possible to make it faster.

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 10, 2025

I double-checked and yes, bace4f6 introduces regression.

Can you please speed up it a bit?

@saghul
Copy link
Contributor

saghul commented Feb 10, 2025

Aren't we talking ~1% perf difference here?

@ABBAPOH
Copy link
Contributor Author

ABBAPOH commented Feb 10, 2025

9761057471 / 9396116795 = 1,0388, closer to 4%.

But yeah, having columns is a nice feature and if it hard or impossible to speed up, we will have to live with that slowdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants