Static Foreign Calls #5495

ChrisPenner · 2024-12-10T01:34:56Z

Overview

Foreign Calls were being dispatched dynamically. This puts all foreign calls in a big'ol case statement and inlines all the ForeignConvention typeclasses.

Interestingly, The first iteration of this PR was slower than trunk, but explicitly NOINLINE'ing the foreign calls (with a wrapper to ensure Stack gets unboxed) sped it up significantly. This implies that code size of eval was affecting things code-caching significantly, which isn't too surprising considering these are all microbenchmarks. So we should be careful about how much code we have in our main loop; maybe even should consider splitting off some of the lesser used instructions into their own chunk of code.

It's actually quite surprising that this change now speeds up the regular suite, despite it not using foreign calls for most of its benches.

Implementation notes

Make a big ol' case statement for every foreign call
Use that instead of FF records
Adds a new Sandbox Failure instruction
Move sandboxing for foreigns to a pre-processing step over all source code which removes foreign calls and replaces them with an instruction that just errors.
Fix sandboxing of Ref instructions which were added previously

Test coverage

Tested that the new foreign call sandboxing works as expected.

Benchmarks

foreigns : '{IO, Exception} ()
foreigns = do
  printTime
    "time-foreigns" 10 (n ->
      repeat n do
        repeat 1000 do
          use Either toBug
          t = toBug <| monotonic.impl()
          _ = sec t
          _ = nsec t
          -- repeat ~100 times
  tvar = TVar.newIO 0
  printTime
    "tvar foreigns" 10 (n ->
      repeat n do
        repeat 1000 do
          atomically do
            _ = read tvar
            _ = write tvar 1
            _ = swap tvar 10
            ~repeat ~100 times

trunk -> new

Time foreigns
25.45208ms -> 23.89372ms

tvar foreigns
42.9391ms -> 36.974ms

Normal suite, I wouldn't have expected these to change much since they don't use foreigns very much, but it looks like things still somehow improved a bit 🎉

fib1
325.198µs -> 314.807µs

fib2
2.293117ms -> 2.259432ms

fib3
2.671696ms -> 2.646835ms

Decode Nat
345ns -> 337ns

Generate 100 random numbers
210.18µs -> 207.108µs

List.foldLeft
2.141224ms -> 2.025939ms

Count to 1 million
128.4097ms -> 124.7779ms

Json parsing (per document)
268.28µs -> 261.794µs

Count to N (per element)
191ns -> 188ns

Count to 1000
192.286µs -> 188.466µs

Mutate a Ref 1000 times
318.636µs -> 314.031µs

CAS an IO.ref 1000 times
427.475µs -> 425.158µs

List.range (per element)
332ns -> 325ns

List.range 0 1000
349.66µs -> 345.243µs

Set.fromList (range 0 1000)
1.681155ms -> 1.592714ms

Map.fromList (range 0 1000)
1.242585ms -> 1.173887ms

NatMap.fromList (range 0 1000)
4.986759ms -> 4.871296ms

Map.lookup (1k element map)
2.617µs -> 2.546µs

Map.insert (1k element map)
7.114µs -> 6.888µs

List.at (1k element list)
285ns -> 277ns

Text.split /
32.893µs -> 32.688µs

Loose ends

Probably worth investigating the overhead of the sandbox checks in the interpreter, we may be able to remove them for instructions as well.

ChrisPenner · 2024-12-10T18:06:28Z

unison-runtime/src/Unison/Runtime/Builtin.hs

All the builtin implementations moved to Unison.Runtime.Foreign.Function; the builtin names are in Unison.Runtime.Foreign.Function.Type

ChrisPenner · 2024-12-10T18:08:01Z

unison-runtime/src/Unison/Runtime/Foreign/Function.hs

This module has been replaced wholesale with the big ol' case statement of builtin implementations

ChrisPenner · 2024-12-10T18:15:15Z

unison-runtime/src/Unison/Runtime/MCode.hs

@@ -278,6 +285,7 @@ argsToLists = \case
  VArgR i l -> take l [i ..]
  VArgN us -> primArrayToList us
  VArgV _ -> internalBug "argsToLists: DArgV"
+{-# INLINEABLE argsToLists #-}


Since argsToLists is in a different module from where it's used, you need to explicitly tell GHC to expose it's implementation so it can be inlined, which helps to fuse away the lists.

ChrisPenner · 2024-12-10T18:43:53Z

unison-runtime/src/Unison/Runtime/Machine.hs

-  pure stk
-bprim1 !stk TIKR i = do
+bprim1 !env !stk RRFC i
+  | sandboxed env = die "attempted to use sandboxed operation: Ref.readForCAS"


I missed this sandboxing check as part of the previous instructions PR

dolio

Looks mostly good. I left a couple notes, though.

I'm not really a fan of just sprinkling around 'optimization' stuff that probably or definitely doesn't do anything. It makes it harder for people to know/learn what annotations and such actually matter. And often you do need to actually know and investigate whether they actually matter to get good results, because just following a simple script like 'bang patterns on every argument' does not always give good results, like we ended up seeing with the enum map change.

dolio · 2024-12-13T16:29:24Z

unison-runtime/src/Unison/Runtime/Foreign/Function.hs


 instance (ForeignConvention a) => ForeignConvention (Maybe a) where
-  readForeign (i : args) stk =
+  readForeign !(i : args) !stk =


The bang pattern in !(i : args) doesn't do anything. Pattern matching on a data type is already strict.

Agreed, I removed all the bangs on obvious pattern matching. I also tried removing all the bangs to see what GHC would do and it slowed things down noticeably, so I left the other bangs in place 👍🏼

dolio · 2024-12-13T16:30:33Z

unison-runtime/src/Unison/Runtime/Foreign/Function.hs


 instance ForeignConvention Text where
  readForeign = readForeignBuiltin
+  {-# INLINE readForeign #-}


All these inlines for trivial x = y definitions likely don't do anything but add line count.

I think we definitely do want GHC to inline this, since obviously it's small, and will mean GHC specializes it rather than leaving things polymorphic.

So I'm assuming you're just saying it's redundant because GHC will almost certainly inline it, but in that case, do you have a specific rule for how complex definitions should be before I add an explicit inline?

I can just remove it from all the x = y cases if you like, but personally I'd prefer in this case to leave the inlines on all of them since:

We do want them to be inlined

It's likely to reduce the chances that future ForeignConvention instances accidentally miss an Inline

If it ever gets moved to a different module GHC probably won't inline it without the pragma.

I tried removing all INLINE pragmas and the core definitely changes, but the timings still seem good, so we could also just do that; just need to be careful not to split the typeclass away from the module :)

Just let me know what you'd prefer and I'll try that and benchmark it 👍🏼 😄

I'm not saying we don't want them to be inlined. I'm saying INLINE pragmas on x = y definitions are just telling GHC to do things it will decide to do anyway. It's not true that it only does cross-module inlining if you tell it to. And as far as I can tell, type classes don't introduce an obstacle, either.

I'm also not saying that it's not better to err a little on being more explicit. However, the heuristic for cross-module inlining is, "small," and x = y is as small as it gets. So, like, keep in mind what GHC does do (and I checked in this case).

dolio · 2024-12-13T17:29:33Z

BTW, I didn't meticulously read the big case statement for foreign functions. But I assume it was just cut and paste from stuff that already existed, so probably doesn't require a ton of attention.

The pruning was causing problems with compiled programs when inlining was on, because it would prune based on the inlined code. The inlined code may have certain intermediate combinators omitted, but those are still necessary to have a full picture of the source code. Since `compile` was using the MCode numbering and backing out which References are necessary from that, it would throw away the source code for these intermediate definitions. This then caused problems when e.g. cloud (running from a compiled build) would try to send code to other environments. It wouldn't have the intermediate terms necessary for the remote environment to do its own intermediate->interpreter step. This new approach does all the 'necessary terms' tracing at the intermediate level, and then instead determines which MCode level defintions are necessary from that. This means that the pruning is no longer sensitive to the inlining. So, it should be safe to turn inlining back on.

Do reference-based pruning for ucm compile, turn back on inlining

ChrisPenner added 9 commits December 4, 2024 14:32

Define enum for all foreign calls

91cb40a

Implement a bunch of builtin impls

1444456

Finish porting over foreign calls

cae64d7

WIP on switching from numbered foreign funcs

12dbac8

Remove all the old sandboxing

d53b00b

Sandbox foreigns with a preprocessing step.

d3c9c69

Merge trunk back into inline-foreign-calls

6b63eb1

Get Stack unboxing more reliably

dfac404

Inline argsToLists

5e2b968

ChrisPenner changed the base branch from trunk to wip/remove-cycle-length-draft December 10, 2024 17:37

ChrisPenner changed the base branch from wip/remove-cycle-length-draft to trunk December 10, 2024 17:38

Fix MCode Serialization tests

adc5f20

ChrisPenner commented Dec 10, 2024

View reviewed changes

Replace unused Foreign Function module with Impl

2dddbdf

ChrisPenner commented Dec 10, 2024

View reviewed changes

ChrisPenner added 2 commits December 10, 2024 10:33

Cleanup and docs

316e452

Rename sanitization

af717a7

ChrisPenner commented Dec 10, 2024

View reviewed changes

ChrisPenner marked this pull request as ready for review December 10, 2024 18:44

ChrisPenner requested a review from dolio December 10, 2024 18:45

dolio requested changes Dec 13, 2024

View reviewed changes

ChrisPenner added 2 commits December 13, 2024 11:58

Remove obviously redundant bang patterns

6779235

Remove all INLINEs on ForeignConvention

33db037

ChrisPenner force-pushed the cp/inlined-foreign-calls branch from 4e23143 to 33db037 Compare December 13, 2024 19:59

dolio approved these changes Dec 16, 2024

View reviewed changes

aryairani and others added 2 commits December 16, 2024 12:12

Merge pull request #5507 from unisonweb/fix/interp-inlining

28973da

Do reference-based pruning for ucm compile, turn back on inlining

Re-merge trunk

968be8b

ChrisPenner merged commit 94f7c1b into trunk Dec 16, 2024
32 checks passed

ChrisPenner deleted the cp/inlined-foreign-calls branch December 16, 2024 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static Foreign Calls #5495

Static Foreign Calls #5495

ChrisPenner commented Dec 10, 2024 •

edited

Loading

ChrisPenner Dec 10, 2024

ChrisPenner Dec 10, 2024

ChrisPenner Dec 10, 2024

ChrisPenner Dec 10, 2024

dolio left a comment

dolio Dec 13, 2024

ChrisPenner Dec 13, 2024

dolio Dec 13, 2024

ChrisPenner Dec 13, 2024 •

edited

Loading

dolio Dec 13, 2024 •

edited

Loading

dolio commented Dec 13, 2024

Static Foreign Calls #5495

Static Foreign Calls #5495

Conversation

ChrisPenner commented Dec 10, 2024 • edited Loading

Overview

Implementation notes

Test coverage

Loose ends

ChrisPenner Dec 10, 2024

Choose a reason for hiding this comment

ChrisPenner Dec 10, 2024

Choose a reason for hiding this comment

ChrisPenner Dec 10, 2024

Choose a reason for hiding this comment

ChrisPenner Dec 10, 2024

Choose a reason for hiding this comment

dolio left a comment

Choose a reason for hiding this comment

dolio Dec 13, 2024

Choose a reason for hiding this comment

ChrisPenner Dec 13, 2024

Choose a reason for hiding this comment

dolio Dec 13, 2024

Choose a reason for hiding this comment

ChrisPenner Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

dolio Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

dolio commented Dec 13, 2024

ChrisPenner commented Dec 10, 2024 •

edited

Loading

ChrisPenner Dec 13, 2024 •

edited

Loading

dolio Dec 13, 2024 •

edited

Loading