NG is slower compared to Bellard version #876

ABBAPOH · 2025-02-02T14:50:14Z

So, I ran Valgrind with 2 engines and it seems that NG is about 3% slower and also consumes more memory

========== Performance data for Resolving ==========
    Old instruction count: 9396116795
    New instruction count: 9761057471
    Relative change: +3 %
    Old peak memory usage: 118112672 Bytes
    New peak memory usage: 120718720 Bytes
    Relative change: +2 %

Results seem to be consistent between different runs.

However, attempt to use sampling profiler on MacOS does not provide consistent results making hard to track down the reason of degradation.
Some suspects are (the numbers are sample counts in 3 different runs):
Old
js_create_function 32/28/42
__JS_EvalInternal 87/73/81

New
js_create_function 39/36/39
__JS_EvalInternal 105/84/83
It seems that both functions are a bit slower, but that could be natural variance between runs.
Bisecting this would be harder this time, but if there's no other way, I can try doing so.

In the meantime, maybe you have any hints what could cause regression?

The text was updated successfully, but these errors were encountered:

saghul · 2025-02-02T16:57:41Z

The memory consumption could be explained by the poly IC and the number you showed are what, ms? They look reasonably close to margin or error?

What benchmark did you run?

ABBAPOH · 2025-02-02T18:55:43Z

The bottom numbers are "sample numbers" - the amount of samples profiler took while stack pointer was in this particular function. But yeah, it seems to get a sample every 1ms.
It does seem to be random, that's why I am not sure this is related to eval/js_create_function.
OTOH the overall instruction count is way bigger according to Valgrind - even 1% is concerning.

I ran our build system on itself - the typical benchmark we do when doing changes to the language. There's a lot of JS code involved so it's hard to pinpoint the exact place.

bnoordhuis · 2025-02-03T00:16:04Z

Are both binaries built with the (exact!) same compiler and compiler flags? Even flag order can matter.

Does your benchmark contain a lot of code that runs only once? The IC infrastructure is pure overhead for such code at the moment; you pay for bookkeeping but without any payoff. It's on my todo list to introduce a IC preinit mode that delays IC creation until the second or third time the bytecode is executed.

ABBAPOH · 2025-02-03T12:05:26Z

Are both binaries built with the (exact!) same compiler and compiler flags? Even flag order can matter.

Yes, the only difference was that the new code use c11 instead of c99; but I excluded it by compiling the old version with c11 too.

Does your benchmark contain a lot of code that runs only once?

Can be, yes.
Can you please give the SHA of the commit that introduced the IC stuff so I can try benchmarking before it was introduced? Otherwise I will do a full bisect search.
Or can I disable the feature somehow?

saghul · 2025-02-03T12:13:58Z

IC was introduced quite a while ago, in v0.2.0: #120

There is no way to disable it, currently.

bnoordhuis · 2025-02-03T22:00:02Z

See #883 and #884. I ran some benchmarks where I, ahem, surgically removed ICs and here are the results for web-tooling-benchmark (>1.0 means ICs are faster, <1.0 means slower):

       acorn: 1.024
       babel: 0.9386
babel-minify: 1.419
     babylon: 0.9375
       buble: 0.9521
        chai: 0.9507
      espree: 1.033
     esprima: 1.055
      jshint: 0.9683
       lebab: 0.9649
     postcss: 0.9714
     prepack: 0.9403
    prettier: 1.543
  source-map: 1.061
      terser: 1.0
  typescript: 0.8935
   uglify-js: 0.9759
=====================
        mean: 1.0367

babel-minify and prettier benefit a great deal but for everything else it's either a minor win or slower; a good deal slower in case of the typescript compiler benchmark.

The mean suggests they're still a net win on average, or at least, a net win when running benchmarks 🙈

I don't know if that means keep 'em, chuck 'em, or some third option. We're trading memory for CPU. Which is more important?

I also experimented with pre-init ICs. They regress the babel benchmark 5x (which I expected albeit not that dramatically) but otherwise don't move the needle much, which makes sense because they're benchmarks, they don't contain much code that only runs once.

saghul · 2025-02-04T08:22:13Z

I don't think I'm qualified enough to decide, so I'll ask some questions :-)

Is there something left on the table that would make the IC work "shine" more?
Does Add prototype inline cache #551 move the needle in any direction?
Intuitively I guess we do want both (of course) best CPU and memory usage
- CPU in cases like lambda functions
- Memory in constrained environments like IoT?
IIRC all "big" engines have this. What does it mean? Do they rely on JIT to make the performance leap?
Would it be hard to have a way to tune it, so those contrained environments could tweak it to their needs? Does that even make sense?

ammarahm-ed · 2025-02-04T08:59:11Z

@saghul I have a feeling the benchmarks give different results based on what arch you are on, arm64 or x86. Before quickjs-ng came in existence I ran some benchmarks with ic/proto ic and noticed it made performance a lot better. But on my arm64 m4 macbook today, the difference shrinks. So i think it depends on where you are running quickjs?

On an average system, IC will make a big difference compared to No IC.

I can share benchmarks of different qjc variants here if that helps.

saghul · 2025-02-04T09:03:33Z

Good point about architecture!

gschwind · 2025-02-04T09:47:17Z

Hello,

It's look like we can remove it, maybe inline cache can be made optional with compile time option ?

Best regards

ABBAPOH · 2025-02-04T09:52:52Z

See #883 and #884

Thanks, I'll try these!

In the meantime, I tried applying our changes on top of #120, but recent version of changes gives too many conflicts, so I'd need to try a different approach.

saghul · 2025-02-04T10:09:00Z

In the meantime, I tried applying our changes on top of #120, but recent version of changes gives too many conflicts, so I'd need to try a different approach.

Applying #884 would be the equivalent to the state of the repo pre-120, but with all current changes.

bnoordhuis · 2025-02-04T10:17:13Z

For the record: the benchmark numbers are from an i7 Intel system.

IIRC all "big" engines have this. What does it mean? Do they rely on JIT to make the performance leap?

V8/SM/JSC:

use ICs to collect type feedback for their optimizing tiers; that's more important to them than optimizing property access at the lowest tier because hot code doesn't stay at that tier long
have {mono,poly,mega}morphic ICs whereas we only have mono/poly

For property accesses with a degree of polymorphism > 4, we scan the IC cache and perform a hash table lookup in case of a cache miss.

Megamorphic ICs are an end state that says "too much polymorphism so don't bother and do the hash table lookup right away."

They're a potential performance trap though because it's not uncommon for long-living code to have lots polymorphism at first, then stabilize when the program reaches steady state.

I think most optimizing JS engines have second-chance strategies to mitigate that but fine-tuning that is finicky Goldilocks stuff: don't want to flush too often, don't want to wait too long, has to be just right.

Would it be hard to have a way to tune it, so those contrained environments could tweak it to their needs?

At the moment the only thing you could tweak is the number of cache slots, currently 4. Lower that to 3 and the size of an IC on 64 bits systems drops from 56 to 44 bytes, 32 bytes when lowered to 2.

I just tried with N=3. Most benchmarks are unaffected but babel-minify and typescript slow down by 40% and 46% respectively.

Is there something left on the table that would make the IC work "shine" more?

Yes, but no low hanging fruit. This comment is already way too long so I'll save that for another time :p

ABBAPOH · 2025-02-04T17:27:22Z

I ran Valgrind with #884 applied

It does solve the memory issue; but instruction count is only 1% lower; apparently there's something else.

========== Performance data for Resolving ==========
    Old instruction count: 9394171800
    New instruction count: 9624124580
    Relative change: +2 %
    Old peak memory usage: 118016496 Bytes
    New peak memory usage: 117458384 Bytes
    Relative change: 0 %

One thing I want to try is reinling functions - the old version deinlined them for some reason (probably to avoid using c11); I don't think this would change anything; but worth trying to minimise the difference.

bnoordhuis · 2025-02-04T20:02:26Z

Does Add prototype inline cache #551 move the needle in any direction?

I checked and it does, but in the wrong direction. No benchmarks improve but some regress.

bnoordhuis · 2025-02-04T20:14:59Z

It does solve the memory issue; but instruction count is only 1% lower; apparently there's something else.

bellard/master misses a couple of correctness fixes; it doesn't record the PC (sf->cur_pc = pc) in enough places. quickjs-ng does but that comes at a cost.

bnoordhuis · 2025-02-04T20:39:53Z

@saghul after mulling it over for a bit, I think I'm in favor of removing ICs and starting from first principles again.

The CPU/memory trade-off isn't quite there with the current implementation and that implementation is also kind of a straitjacket if you want to go in a completely different direction.

WDYT?

ABBAPOH · 2025-02-04T21:08:19Z

bellard/master misses a couple of correctness fixes; it doesn't record the PC (sf->cur_pc = pc) in enough places. quickjs-ng does but that comes at a cost.

A couple of if's or a single assignment should not have this much of an effect. That's why am sceptical about inlining I mentioned earlier, but I want to exclude that, just in case.

This looks like extra (re)allocations or greedy algorithm according to my experience with the profiler.

bnoordhuis · 2025-02-04T21:33:12Z

A couple of if's or a single assignment should not have this much of an effect.

They do when they're part of a bytecode handler. Some bytecodes execute millions or billions of times over the lifetime of a program.

saghul · 2025-02-04T21:58:00Z

@saghul after mulling it over for a bit, I think I'm in favor of removing ICs and starting from first principles again.

The CPU/memory trade-off isn't quite there with the current implementation and that implementation is also kind of a straitjacket if you want to go in a completely different direction.

WDYT?

👍 let's do it.

chqrlie · 2025-02-04T22:58:41Z

@saghul after mulling it over for a bit, I think I'm in favor of removing ICs and starting from first principles again.

The CPU/memory trade-off isn't quite there with the current implementation and that implementation is also kind of a straitjacket if you want to go in a completely different direction.

WDYT?

I agree. There are some ideas for a simpler yet more effective direction worth investigating.

ammarahm-ed · 2025-02-05T05:03:29Z

Ran some benchmarks with different variants of quickjs

 ./quickjs-bellard/qjs --stack-size 99999999 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.96 runs/s
         babel:  2.78 runs/s
  babel-minify:  2.82 runs/s
       babylon:  2.03 runs/s
         buble:  3.13 runs/s
          chai:  3.95 runs/s
        espree:  0.63 runs/s
       esprima:  1.25 runs/s
        jshint:  2.24 runs/s
         lebab:  1.96 runs/s
       postcss:  1.62 runs/s
       prepack:  3.08 runs/s
      prettier:  1.26 runs/s
    source-map:  1.49 runs/s
        terser:  6.37 runs/s
    typescript:  4.48 runs/s
     uglify-js:  1.71 runs/s
-------------------------------------
Geometric mean:  2.10 runs/s

./quickjs-ng-main/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.97 runs/s
         babel:  2.80 runs/s
  babel-minify:  2.79 runs/s
       babylon:  1.95 runs/s
         buble:  3.13 runs/s
          chai:  3.85 runs/s
        espree:  0.63 runs/s
       esprima:  1.33 runs/s
        jshint:  2.24 runs/s
         lebab:  2.05 runs/s
       postcss:  1.29 runs/s
       prepack:  2.92 runs/s
      prettier:  1.26 runs/s
    source-map:  1.42 runs/s
        terser:  6.25 runs/s
    typescript:  4.33 runs/s
     uglify-js:  1.20 runs/s
-------------------------------------
Geometric mean:  2.02 runs/s

./quickjs-ng-main-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  1.26 runs/s
         babel:  3.61 runs/s
  babel-minify:  4.46 runs/s
       babylon:  2.46 runs/s
         buble:  4.46 runs/s
          chai:  6.80 runs/s
        espree:  0.81 runs/s
       esprima:  2.07 runs/s
        jshint:  4.22 runs/s
         lebab:  3.64 runs/s
       postcss:  1.97 runs/s
       prepack:  4.13 runs/s
      prettier:  2.10 runs/s
    source-map:  2.28 runs/s
        terser: 10.53 runs/s
    typescript:  5.26 runs/s
     uglify-js:  3.02 runs/s
-------------------------------------
Geometric mean:  3.12 runs/s

./quickjs-ng-prototype-inline-cache/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.94 runs/s
         babel:  2.70 runs/s
  babel-minify:  2.84 runs/s
       babylon:  1.95 runs/s
         buble:  2.86 runs/s
          chai:  3.76 runs/s
        espree:  0.63 runs/s
       esprima:  1.27 runs/s
        jshint:  2.11 runs/s
         lebab:  1.84 runs/s
       postcss:  1.64 runs/s
       prepack:  2.96 runs/s
      prettier:  1.22 runs/s
    source-map:  1.40 runs/s
        terser:  5.85 runs/s
    typescript:  4.12 runs/s
     uglify-js:  1.14 runs/s
-------------------------------------
Geometric mean:  1.97 runs/s

./quickjs-ng-prototype-inline-cache-mimalloc/build/qjs --stack-size 2048 --script dist/cli.j
s
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  1.23 runs/s
         babel:  3.48 runs/s
  babel-minify:  4.31 runs/s
       babylon:  2.45 runs/s
         buble:  4.05 runs/s
          chai:  6.76 runs/s
        espree:  0.77 runs/s
       esprima:  1.98 runs/s
        jshint:  4.00 runs/s
         lebab:  3.47 runs/s
       postcss:  2.37 runs/s
       prepack:  4.07 runs/s
      prettier:  2.04 runs/s
    source-map:  2.24 runs/s
        terser:  9.62 runs/s
    typescript:  4.90 runs/s
     uglify-js:  2.70 runs/s
-------------------------------------
Geometric mean:  3.02 runs/s

./quickjs-preinit-ic/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.96 runs/s
         babel:  2.78 runs/s
  babel-minify:  2.69 runs/s
       babylon:  1.93 runs/s
         buble:  3.03 runs/s
          chai:  3.75 runs/s
        espree:  0.62 runs/s
       esprima:  1.31 runs/s
        jshint:  2.27 runs/s
         lebab:  1.97 runs/s
       postcss:  1.29 runs/s
       prepack:  2.94 runs/s
      prettier:  1.26 runs/s
    source-map:  1.41 runs/s
        terser:  6.41 runs/s
    typescript:  4.35 runs/s
     uglify-js:  1.22 runs/s
-------------------------------------
Geometric mean:  2.00 runs/s

./quickjs-preinit-ic-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  1.26 runs/s
         babel:  3.60 runs/s
  babel-minify:  4.42 runs/s
       babylon:  2.41 runs/s
         buble:  4.42 runs/s
          chai:  6.71 runs/s
        espree:  0.80 runs/s
       esprima:  2.05 runs/s
        jshint:  4.27 runs/s
         lebab:  3.66 runs/s
       postcss:  1.93 runs/s
       prepack:  4.12 runs/s
      prettier:  2.12 runs/s
    source-map:  2.26 runs/s
        terser: 10.42 runs/s
    typescript:  5.21 runs/s
     uglify-js:  2.99 runs/s
-------------------------------------
Geometric mean:  3.10 runs/s

./quickjs-rm-ic/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  0.91 runs/s
         babel:  2.76 runs/s
  babel-minify:  2.76 runs/s
       babylon:  1.90 runs/s
         buble:  3.16 runs/s
          chai:  3.77 runs/s
        espree:  0.62 runs/s
       esprima:  1.28 runs/s
        jshint:  2.29 runs/s
         lebab:  1.95 runs/s
       postcss:  1.72 runs/s
       prepack:  3.00 runs/s
      prettier:  1.24 runs/s
    source-map:  1.41 runs/s
        terser:  6.06 runs/s
    typescript:  2.41 runs/s
     uglify-js:  1.65 runs/s
-------------------------------------
Geometric mean:  1.99 runs/s

./quickjs-rm-ic-mimalloc/build/qjs --stack-size 2048 --script dist/cli.js
Running Web Tooling Benchmark v0.5.3…
-------------------------------------
         acorn:  1.23 runs/s
         babel:  3.55 runs/s
  babel-minify:  4.44 runs/s
       babylon:  2.40 runs/s
         buble:  4.48 runs/s
          chai:  6.76 runs/s
        espree:  0.78 runs/s
       esprima:  1.98 runs/s
        jshint:  4.27 runs/s
         lebab:  3.64 runs/s
       postcss:  2.47 runs/s
       prepack:  4.20 runs/s
      prettier:  2.11 runs/s
    source-map:  2.22 runs/s
        terser: 10.53 runs/s
    typescript:  3.50 runs/s
     uglify-js:  3.01 runs/s
-------------------------------------
Geometric mean:  3.06 runs/s

ammarahm-ed · 2025-02-05T05:20:22Z

Some more benchmarks:

Running benchmark for ./quickjs-bellard/qjs...
Box2D: 6317
CodeLoad: 22071
PdfJS: 5683
Typescript: 20323
Mandreel: 2043
MandreelLatency: 14837
Richards: 1640
DeltaBlue: 1269
Crypto: 2100
RayTrace: 1092
EarleyBoyer: 2275
RegExp: 401
Splay: 2035
SplayLatency: 8082
NavierStokes: 4373
----
Score (version 9): 3513
Running benchmark for ./quickjs-ng-main/build/qjs...
Box2D: 6958
CodeLoad: 19877
PdfJS: 5780
Typescript: 21006
Mandreel: 1538
MandreelLatency: 7364
Richards: 1806
DeltaBlue: 1437
Crypto: 1853
RayTrace: 1071
EarleyBoyer: 2244
RegExp: 387
Splay: 2518
SplayLatency: 8640
NavierStokes: 3512
----
Score (version 9): 3318
Running benchmark for ./quickjs-ng-main-mimalloc/build/qjs...
Box2D: 8037
CodeLoad: 32181
PdfJS: 7206
Typescript: 26421
Mandreel: 1526
MandreelLatency: 9699
Richards: 1790
DeltaBlue: 1679
Crypto: 1859
RayTrace: 2635
EarleyBoyer: 4103
RegExp: 524
Splay: 5044
SplayLatency: 13698
NavierStokes: 3488
----
Score (version 9): 4463
Running benchmark for ./quickjs-ng-prototype-inline-cache/build/qjs...
Box2D: 6720
CodeLoad: 19421
PdfJS: 5463
Typescript: 19538
Mandreel: 1477
MandreelLatency: 8870
Richards: 1864
DeltaBlue: 1513
Crypto: 1800
RayTrace: 1121
EarleyBoyer: 2226
RegExp: 380
Splay: 1830
SplayLatency: 6994
NavierStokes: 3376
----
Score (version 9): 3201
Running benchmark for ./quickjs-ng-prototype-inline-cache-mimalloc/build/qjs...
Box2D: 7576
CodeLoad: 31319
PdfJS: 7052
Typescript: 24135
Mandreel: 1479
MandreelLatency: 9569
Richards: 1859
DeltaBlue: 1852
Crypto: 1797
RayTrace: 2524
EarleyBoyer: 3691
RegExp: 518
Splay: 4119
SplayLatency: 13863
NavierStokes: 3383
----
Score (version 9): 4310
Running benchmark for ./quickjs-preinit-ic/build/qjs...
Box2D: 6830
CodeLoad: 19352
PdfJS: 5688
Typescript: 20342
Mandreel: 1525
MandreelLatency: 10652
Richards: 1763
DeltaBlue: 1272
Crypto: 1841
RayTrace: 1099
EarleyBoyer: 2238
RegExp: 392
Splay: 2532
SplayLatency: 8825
NavierStokes: 3425
----
Score (version 9): 3352
Running benchmark for ./quickjs-preinit-ic-mimalloc/build/qjs...
Box2D: 7927
CodeLoad: 31130
PdfJS: 7124
Typescript: 25895
Mandreel: 1523
MandreelLatency: 9756
Richards: 1742
DeltaBlue: 1735
Crypto: 1854
RayTrace: 2659
EarleyBoyer: 4108
RegExp: 514
Splay: 4987
SplayLatency: 13363
NavierStokes: 3451
----
Score (version 9): 4426
Running benchmark for ./quickjs-rm-ic/build/qjs...
Box2D: 6065
CodeLoad: 20605
PdfJS: 5448
Typescript: 19698
Mandreel: 1541
MandreelLatency: 9048
Richards: 1546
DeltaBlue: 1365
Crypto: 1826
RayTrace: 1101
EarleyBoyer: 2286
RegExp: 389
Splay: 2505
SplayLatency: 9418
NavierStokes: 3451
----
Score (version 9): 3291
Running benchmark for ./quickjs-rm-ic-mimalloc/build/qjs...
Box2D: 6884
CodeLoad: 32655
PdfJS: 6821
Typescript: 24558
Mandreel: 1540
MandreelLatency: 11103
Richards: 1557
DeltaBlue: 1510
Crypto: 1845
RayTrace: 2630
EarleyBoyer: 4110
RegExp: 512
Splay: 4922
SplayLatency: 15831
NavierStokes: 3478
----
Score (version 9): 4380

ahaoboy · 2025-02-05T13:44:29Z

I have also noticed that quickjs-ng performance is slower even when compiled with the same toolchain on the same platform.

Since some of the comparison test engines only support es5, so only use the v8-v7 test suite

https://github.com/[ahaoboy/js-engine-benchmark](https://github.com/ahaoboy/js-engine-benchmark)

Both tjs and llrt use quickjs-ng, but their test performance is much better, I don't know why

Engine	llrt	tjs	qjs	qjs(ng)
Version	0.4.0-beta	24.6.0	2024-02-14	0.8.0
Total size	10.6M	5.1M	4.6M	2.1M
Exe size	10.6M	5.1M	4.6M	2.1M
Dll size	0	0	0	0
Richards	828	681	708	630
DeltaBlue	740	668	685	612
Crypto	690	606	763	381
RayTrace	1210	1113	916	702
EarleyBoyer	2009	1783	1476	1219
RegExp	198	254	228	181
Splay	1734	1956	1718	1115
NavierStokes	1158	1043	1322	971
Score	894	854	845	629

ammarahm-ed · 2025-02-05T14:04:41Z

@ahaoboy can you try running the benchmark with removed ics PR?

ammarahm-ed · 2025-02-05T14:17:54Z

When poly_ic was added to quickjs-ng, at that time it was indeed much faster than it is today:

./quickjs-ng-poly_ic/build/qjs ./combined.js
Box2D: 6911
CodeLoad: 21546
PdfJS: 5847
Typescript: 21079
Mandreel: 1972
MandreelLatency: 14662
Richards: 1797
DeltaBlue: 1299
Crypto: 2127
RayTrace: 1129
EarleyBoyer: 2286
RegExp: 402
Splay: 1962
SplayLatency: 7099
NavierStokes: 4541
----
Score (version 9): 3541


WITH MIMALLOC
env DYLD_INSERT_LIBRARIES=/opt/homebrew/lib/libmimalloc.dylib ./quickjs-ng-poly_ic/build/qjs ./combined.js
Box2D: 7989
CodeLoad: 33395
PdfJS: 7377
Typescript: 26486
Mandreel: 2003
MandreelLatency: 14926
Richards: 1822
DeltaBlue: 1793
Crypto: 2149
RayTrace: 2440
EarleyBoyer: 3968
RegExp: 538
Splay: 3932
SplayLatency: 13325
NavierStokes: 4515
----
Score (version 9): 4735

So I think the preformance regression is probably not due to adding poly_ic to quickjs-ng but instead some other change that was done along the way?

ahaoboy · 2025-02-06T02:52:57Z

@ahaoboy can you try running the benchmark with removed ics PR?

I tested it several times, and the scores fluctuated a bit, but the ranking remained basically unchanged.

v8-v7 is a benchmark for es5, but most of the code may use es6, maybe the test results for es6 will be different?

Engine	./qjs-bellard	./qjs-rm-ic	./qjs-ng
Version	0	0	0
Total size	4.6M	1.4M	1.4M
Exe size	4.6M	1.4M	1.4M
Dll size	0	0	0
Richards	863	722	729
DeltaBlue	868	827	849
Crypto	946	481	428
RayTrace	1079	1226	1064
EarleyBoyer	1716	1757	1609
RegExp	226	192	190
Splay	1693	2088	2123
NavierStokes	1568	1056	887
Score	971	846	797

xeioex · 2025-02-06T17:03:31Z

@ahaoboy, @ammarahm-ed

To make a correct comparison (statistically correct), which take into account uncertainties. It is better to collect up to 10-20 results per binary and feed the results to ministat.

Inline caches are supposed to improve performance but benchmarks show it's currently a mixed bag: sometimes faster, sometimes slower, always more memory hungry. After due consideration, I think it's better to remove the current implementation and start afresh. Refs: #876

ammarahm-ed · 2025-02-08T05:10:54Z

@xeioex I ran it many times, the difference is consistently there.

ABBAPOH · 2025-02-09T13:57:17Z

Valgrind results on top of 6b78c7f (IC added)
new/old instructions = 1,0083%, difference is negligible

========== Performance data for Resolving ==========
    Old instruction count: 9414388368
    New instruction count: 9492696521
    Relative change: 0 %
    Old peak memory usage: 117179024 Bytes
    New peak memory usage: 121025176 Bytes
    Relative change: +3 %

Valgrind results on top of 5c3077e (commit before IC added)

new/old instructions = 0,9932%, so indeed IC added some overhead; but it's not that big

========== Performance data for Resolving ==========
    Old instruction count: 9387259784
    New instruction count: 9323831689
    Relative change: 0 %
    Old peak memory usage: 118016408 Bytes
    New peak memory usage: 117787240 Bytes
    Relative change: 0 %

This proves that the instructions count regression was added later.
Memory regression is indeed is due to IC, but I don't think this is a big problem for us, especially if this feature can be disabled in the future.
I'll try pinpointing the exact commit.

ABBAPOH · 2025-02-09T18:48:26Z

Can it be bace4f6?
Bisect points to it unless I did something wrong.

bnoordhuis · 2025-02-09T21:05:45Z

What benchmark or benchmarks did you use? I can see that commit slowing down parsing but not runtime performance. A suite like web-tooling-benchmark should be largely unaffected.

chqrlie · 2025-02-09T21:56:24Z

Adding column number bytecodes might affect the peephole optimizer...

ABBAPOH · 2025-02-09T22:10:11Z

What benchmark or benchmarks did you use? I can see that commit slowing down parsing but not runtime performance. A suite like web-tooling-benchmark should be largely unaffected.

Yes, we evaluate small peaces of JS code and parse them all the time.
https://github.com/qbs/qbs/blob/master/src/lib/corelib/language/scriptengine.cpp#L841-L845

bnoordhuis · 2025-02-10T08:26:01Z

Adding column number bytecodes might affect the peephole optimizer...

I don't think it's that. There's no separate opcode for column numbers.

I replaced OP_line_num with an opcode that stores both line and column. resolve_labels and code_match filter those out, same as before.

Yes, we evaluate small peaces of JS code and parse them all the time.

Right, I can see how that would regress performance because the parser has to do more work now. I consider that a correctness fix though so don't expect it to be rolled back. However, maybe it's possible to make it faster.

ABBAPOH · 2025-02-10T11:54:05Z

I double-checked and yes, bace4f6 introduces regression.

Can you please speed up it a bit?

saghul · 2025-02-10T12:13:46Z

Aren't we talking ~1% perf difference here?

ABBAPOH · 2025-02-10T16:26:25Z

9761057471 / 9396116795 = 1,0388, closer to 4%.

But yeah, having columns is a nice feature and if it hard or impossible to speed up, we will have to live with that slowdown.

This was referenced Feb 3, 2025

Add pre-init ICs #883

Closed

Remove inline caches #884

Merged

bnoordhuis mentioned this issue Feb 4, 2025

Add prototype inline cache #551

Closed

NG is slower compared to Bellard version #876

NG is slower compared to Bellard version #876

Comments

ABBAPOH commented Feb 2, 2025

saghul commented Feb 2, 2025

ABBAPOH commented Feb 2, 2025

bnoordhuis commented Feb 3, 2025

ABBAPOH commented Feb 3, 2025 • edited Loading

saghul commented Feb 3, 2025

bnoordhuis commented Feb 3, 2025 • edited Loading

saghul commented Feb 4, 2025

ammarahm-ed commented Feb 4, 2025 • edited Loading

saghul commented Feb 4, 2025

gschwind commented Feb 4, 2025

ABBAPOH commented Feb 4, 2025

saghul commented Feb 4, 2025

bnoordhuis commented Feb 4, 2025

ABBAPOH commented Feb 4, 2025

bnoordhuis commented Feb 4, 2025

bnoordhuis commented Feb 4, 2025

bnoordhuis commented Feb 4, 2025

ABBAPOH commented Feb 4, 2025

bnoordhuis commented Feb 4, 2025

saghul commented Feb 4, 2025

chqrlie commented Feb 4, 2025

ammarahm-ed commented Feb 5, 2025

ammarahm-ed commented Feb 5, 2025

ahaoboy commented Feb 5, 2025 • edited Loading

ammarahm-ed commented Feb 5, 2025

ammarahm-ed commented Feb 5, 2025

ahaoboy commented Feb 6, 2025 • edited Loading

xeioex commented Feb 6, 2025

ammarahm-ed commented Feb 8, 2025

ABBAPOH commented Feb 9, 2025

ABBAPOH commented Feb 9, 2025

bnoordhuis commented Feb 9, 2025

chqrlie commented Feb 9, 2025

ABBAPOH commented Feb 9, 2025

bnoordhuis commented Feb 10, 2025

ABBAPOH commented Feb 10, 2025

saghul commented Feb 10, 2025

ABBAPOH commented Feb 10, 2025

ABBAPOH commented Feb 3, 2025 •

edited

Loading

bnoordhuis commented Feb 3, 2025 •

edited

Loading

ammarahm-ed commented Feb 4, 2025 •

edited

Loading

ahaoboy commented Feb 5, 2025 •

edited

Loading

ahaoboy commented Feb 6, 2025 •

edited

Loading