Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow opcode analysis #750

Open
bnoordhuis opened this issue Dec 11, 2024 · 4 comments
Open

Slow opcode analysis #750

bnoordhuis opened this issue Dec 11, 2024 · 4 comments

Comments

@bnoordhuis
Copy link
Contributor

I added per-opcode instrumentation, ran web-tooling-benchmark, and loaded the results in sqlite. Results are interesting.

c = call count
t = total time (in ns)
s = opcode name
r = t / c, avg. opcode execution time (in ns)

sqlite> select *,t/c as r from x order by r desc limit 40;
┌───────────┬───────────────┬───────────────────┬────────┐
│     c     │       t       │         s         │   r    │
├───────────┼───────────────┼───────────────────┼────────┤
│ 297046    │ 160797162930  │ call              │ 541320 │
│ 2325303   │ 429875894509  │ call0             │ 184868 │
│ 14743203  │ 965889142207  │ call2             │ 65514  │
│ 125261899 │ 7370985213642 │ call_method       │ 58844  │
│ 13544997  │ 747222340506  │ call1             │ 55165  │
│ 3377858   │ 149952638311  │ call3             │ 44392  │
│ 1300      │ 27999013      │ apply             │ 21537  │
│ 811163    │ 10971826436   │ for_of_next       │ 13526  │
│ 206593    │ 1219992217    │ for_in_start      │ 5905   │
│ 6318101   │ 37247321981   │ call_constructor  │ 5895   │
│ 17305     │ 79997148      │ fclosure          │ 4622   │
│ 1516      │ 3999859       │ rest              │ 2638   │
│ 10527022  │ 17135605944   │ get_field         │ 1627   │
│ 1412056   │ 1495982423    │ special_object    │ 1059   │
│ 4941144   │ 3055962646    │ array_from        │ 618    │
│ 2902204   │ 1683940059    │ fclosure8         │ 580    │
│ 2351256   │ 1031998868    │ put_var_ref       │ 438    │
│ 3251755   │ 1079997164    │ push_empty_string │ 332    │
│ 3363637   │ 1095996590    │ typeof            │ 325    │
│ 23366458  │ 6731903610    │ put_field         │ 288    │
│ 4744089   │ 1351987428    │ instanceof        │ 284    │
│ 25134101  │ 4935931607    │ add               │ 196    │
│ 448706    │ 79997169      │ put_var_ref3      │ 178    │
│ 13377786  │ 2323953142    │ define_field      │ 173    │
│ 5754872   │ 987965095     │ object            │ 171    │
│ 241515    │ 39998592      │ for_of_start      │ 165    │
│ 27547     │ 3999859       │ div               │ 145    │
│ 753265    │ 107996166     │ regexp            │ 143    │
│ 60393     │ 7999713       │ insert3           │ 132    │
│ 938258    │ 99996458      │ put_var_ref0      │ 106    │
│ 483138    │ 47998310      │ make_loc_ref      │ 99     │
│ 29612509  │ 2779972413    │ get_field2_ic     │ 93     │
│ 123283760 │ 10615837321   │ get_field2        │ 86     │
│ 102269309 │ 8403880188    │ get_array_el      │ 82     │
│ 47267797  │ 3539945727    │ put_field_ic      │ 74     │
│ 277351    │ 19999296      │ delete            │ 72     │
│ 9346829   │ 659976683     │ put_array_el      │ 70     │
│ 483162    │ 31998871      │ put_ref_value     │ 66     │
│ 835601    │ 51998156      │ in                │ 62     │
│ 271774    │ 15999430      │ set_name          │ 58     │
└───────────┴───────────────┴───────────────────┴────────┘

The first entries are call instructions but that's because I can't split between self time and cumulative time; time spent in the function call is attributed to the call opcode.

Observations:

  • for_of_next and for_in_start are surprisingly slow; maybe because of property lookups?

  • fclosure is reeeally slow (4.6 microseconds!); fclosure8 is better bit still sluggish. Could be a benchmarking artifact; the suite is basically a bunch of really big functions

sqlite> select *,t/c as r from x where s like '%fclos%';
┌─────────┬────────────┬───────────┬──────┐
│    c    │     t      │     s     │  r   │
├─────────┼────────────┼───────────┼──────┤
│ 2902204 │ 1683940059 │ fclosure8 │ 580  │
│ 17305   │ 79997148   │ fclosure  │ 4622 │
└─────────┴────────────┴───────────┴──────┘
  • get_field is horrendously slow. Why? Its brethren ops are plenty fast:
sqlite> select *,t/c as r from x where s like '%get_field%';
┌───────────┬─────────────┬───────────────┬──────┐
│     c     │      t      │       s       │  r   │
├───────────┼─────────────┼───────────────┼──────┤
│ 338775163 │ 17707657797 │ get_field_ic  │ 52   │
│ 123283760 │ 10615837321 │ get_field2    │ 86   │
│ 29612509  │ 2779972413  │ get_field2_ic │ 93   │
│ 10527022  │ 17135605944 │ get_field     │ 1627 │
└───────────┴─────────────┴───────────────┴──────┘
  • push_empty_string - an optimization! - is pretty slow at .3 usec. Sound of sad violin playing.
  • typeof is slower than I expected it to be; maybe needs more special casing because the specialized versions are fast
sqlite> select *,t/c as r from x where s like '%typeof%';
┌─────────┬────────────┬─────────────────────┬─────┐
│    c    │     t      │          s          │  r  │
├─────────┼────────────┼─────────────────────┼─────┤
│ 3363637 │ 1095996590 │ typeof              │ 325 │
│ 832065  │ 15999439   │ typeof_is_undefined │ 19  │
│ 321758  │ 11999574   │ typeof_is_function  │ 37  │
└─────────┴────────────┴─────────────────────┴─────┘
@bnoordhuis
Copy link
Contributor Author

I was a little distrustful of the granularity and accuracy of CLOCK_MONOTONIC so I reran the benchmarks with RDTSC and qjs pinned to a single core and a few things changed a lot.

Here is that typeof table and keep in mind t is now in CPU cycles, not nanoseconds (1 cycle ~= 1/3 ns):

sqlite> select *,t/c as r from x where s like '%typeof%';
┌─────────┬──────────┬─────────────────────┬────┐
│    c    │    t     │          s          │ r  │
├─────────┼──────────┼─────────────────────┼────┤
│ 3363637 │ 81868001 │ typeof              │ 24 │
│ 832065  │ 21032232 │ typeof_is_undefined │ 25 │
│ 321758  │ 8738744  │ typeof_is_function  │ 27 │
└─────────┴──────────┴─────────────────────┴────┘

get_field is still slow:

sqlite> select *,t/c as r from x where s like '%get_field%';
┌───────────┬─────────────┬───────────────┬──────┐
│     c     │      t      │       s       │  r   │
├───────────┼─────────────┼───────────────┼──────┤
│ 10527022  │ 21325986296 │ get_field     │ 2025 │
│ 123294376 │ 5484148340  │ get_field2    │ 44   │
│ 338767481 │ 9652498878  │ get_field_ic  │ 28   │
│ 29612509  │ 668099190   │ get_field2_ic │ 22   │
└───────────┴─────────────┴───────────────┴──────┘

Top 30 with min. 1,000 calls cutoff. put_empty_string disappeared, append appeared, otherwise not very different:

qlite> select *,t/c as r from x where c > 1e3 order by r desc limit 30;
┌───────────┬───────────────┬──────────────────┬────────┐
│     c     │       t       │        s         │   r    │
├───────────┼───────────────┼──────────────────┼────────┤
│ 297138    │ 142073507487  │ call             │ 478139 │
│ 2325303   │ 623059986346  │ call0            │ 267947 │
│ 14733930  │ 1390064423928 │ call2            │ 94344  │
│ 13544997  │ 908799057192  │ call1            │ 67094  │
│ 125272469 │ 7497077076400 │ call_method      │ 59846  │
│ 3359137   │ 171495062049  │ call3            │ 51053  │
│ 1300      │ 25226363      │ apply            │ 19404  │
│ 17305     │ 190480511     │ fclosure         │ 11007  │
│ 811163    │ 8699719011    │ for_of_next      │ 10724  │
│ 6318101   │ 35312541613   │ call_constructor │ 5589   │
│ 206593    │ 468269884     │ for_in_start     │ 2266   │
│ 10527022  │ 21325986296   │ get_field        │ 2025   │
│ 2227      │ 3194527       │ append           │ 1434   │
│ 2902204   │ 3787403331    │ fclosure8        │ 1305   │
│ 1412056   │ 1013840174    │ special_object   │ 717    │
│ 2322      │ 1506405       │ define_method    │ 648    │
│ 1516      │ 739721        │ rest             │ 487    │
│ 4941144   │ 2354852821    │ array_from       │ 476    │
│ 5754872   │ 1998744116    │ object           │ 347    │
│ 241515    │ 79257027      │ for_of_start     │ 328    │
│ 753265    │ 223445041     │ regexp           │ 296    │
│ 483162    │ 128700308     │ put_ref_value    │ 266    │
│ 938258    │ 233680349     │ put_var_ref0     │ 249    │
│ 483138    │ 118385578     │ make_loc_ref     │ 245    │
│ 448706    │ 101718593     │ put_var_ref3     │ 226    │
│ 23366458  │ 5105685616    │ put_field        │ 218    │
│ 13377786  │ 2827971680    │ define_field     │ 211    │
│ 277351    │ 49309130      │ delete           │ 177    │
│ 25134285  │ 4364328704    │ add              │ 173    │
│ 4744089   │ 819711980     │ instanceof       │ 172    │
└───────────┴───────────────┴──────────────────┴────────┘

I'd say this gives some good leads where to start optimizing, like how append is currently implemented in terms of iterators but can likely very fruitfully short-circuit plain arrays.

@saghul
Copy link
Contributor

saghul commented Dec 12, 2024

This is awesome!

@richarddd
Copy link
Contributor

Wow! Fantastic job! Maybe you can gate this behind a flag, capture the output of this in CI and compare with the change so we notice performance regressions/improvments?

@bnoordhuis
Copy link
Contributor Author

Yep, I plan to make it less ad hoc eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants