Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static Foreign Calls #5495

Merged
merged 18 commits into from
Dec 16, 2024
Merged

Static Foreign Calls #5495

merged 18 commits into from
Dec 16, 2024

Conversation

ChrisPenner
Copy link
Contributor

@ChrisPenner ChrisPenner commented Dec 10, 2024

Overview

Foreign Calls were being dispatched dynamically. This puts all foreign calls in a big'ol case statement and inlines all the ForeignConvention typeclasses.

Interestingly, The first iteration of this PR was slower than trunk, but explicitly NOINLINE'ing the foreign calls (with a wrapper to ensure Stack gets unboxed) sped it up significantly. This implies that code size of eval was affecting things code-caching significantly, which isn't too surprising considering these are all microbenchmarks. So we should be careful about how much code we have in our main loop; maybe even should consider splitting off some of the lesser used instructions into their own chunk of code.

It's actually quite surprising that this change now speeds up the regular suite, despite it not using foreign calls for most of its benches.

Implementation notes

  • Make a big ol' case statement for every foreign call
  • Use that instead of FF records
  • Adds a new Sandbox Failure instruction
  • Move sandboxing for foreigns to a pre-processing step over all source code which removes foreign calls and replaces them with an instruction that just errors.
  • Fix sandboxing of Ref instructions which were added previously

Test coverage

  • Tested that the new foreign call sandboxing works as expected.

Benchmarks

foreigns : '{IO, Exception} ()
foreigns = do
  printTime
    "time-foreigns" 10 (n ->
      repeat n do
        repeat 1000 do
          use Either toBug
          t = toBug <| monotonic.impl()
          _ = sec t
          _ = nsec t
          -- repeat ~100 times
  tvar = TVar.newIO 0
  printTime
    "tvar foreigns" 10 (n ->
      repeat n do
        repeat 1000 do
          atomically do
            _ = read tvar
            _ = write tvar 1
            _ = swap tvar 10
            ~repeat ~100 times

trunk -> new

Time foreigns
25.45208ms -> 23.89372ms

tvar foreigns
42.9391ms -> 36.974ms

Normal suite, I wouldn't have expected these to change much since they don't use foreigns very much, but it looks like things still somehow improved a bit 🎉

fib1
325.198µs -> 314.807µs

fib2
2.293117ms -> 2.259432ms

fib3
2.671696ms -> 2.646835ms

Decode Nat
345ns -> 337ns

Generate 100 random numbers
210.18µs -> 207.108µs

List.foldLeft
2.141224ms -> 2.025939ms

Count to 1 million
128.4097ms -> 124.7779ms

Json parsing (per document)
268.28µs -> 261.794µs

Count to N (per element)
191ns -> 188ns

Count to 1000
192.286µs -> 188.466µs

Mutate a Ref 1000 times
318.636µs -> 314.031µs

CAS an IO.ref 1000 times
427.475µs -> 425.158µs

List.range (per element)
332ns -> 325ns

List.range 0 1000
349.66µs -> 345.243µs

Set.fromList (range 0 1000)
1.681155ms -> 1.592714ms

Map.fromList (range 0 1000)
1.242585ms -> 1.173887ms

NatMap.fromList (range 0 1000)
4.986759ms -> 4.871296ms

Map.lookup (1k element map)
2.617µs -> 2.546µs

Map.insert (1k element map)
7.114µs -> 6.888µs

List.at (1k element list)
285ns -> 277ns

Text.split /
32.893µs -> 32.688µs

Loose ends

Probably worth investigating the overhead of the sandbox checks in the interpreter, we may be able to remove them for instructions as well.

@ChrisPenner ChrisPenner changed the base branch from trunk to wip/remove-cycle-length-draft December 10, 2024 17:37
@ChrisPenner ChrisPenner changed the base branch from wip/remove-cycle-length-draft to trunk December 10, 2024 17:38
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the builtin implementations moved to Unison.Runtime.Foreign.Function; the builtin names are in Unison.Runtime.Foreign.Function.Type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module has been replaced wholesale with the big ol' case statement of builtin implementations

@@ -278,6 +285,7 @@ argsToLists = \case
VArgR i l -> take l [i ..]
VArgN us -> primArrayToList us
VArgV _ -> internalBug "argsToLists: DArgV"
{-# INLINEABLE argsToLists #-}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since argsToLists is in a different module from where it's used, you need to explicitly tell GHC to expose it's implementation so it can be inlined, which helps to fuse away the lists.

pure stk
bprim1 !stk TIKR i = do
bprim1 !env !stk RRFC i
| sandboxed env = die "attempted to use sandboxed operation: Ref.readForCAS"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this sandboxing check as part of the previous instructions PR

@ChrisPenner ChrisPenner marked this pull request as ready for review December 10, 2024 18:44
@ChrisPenner ChrisPenner requested a review from dolio December 10, 2024 18:45
Copy link
Contributor

@dolio dolio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. I left a couple notes, though.

I'm not really a fan of just sprinkling around 'optimization' stuff that probably or definitely doesn't do anything. It makes it harder for people to know/learn what annotations and such actually matter. And often you do need to actually know and investigate whether they actually matter to get good results, because just following a simple script like 'bang patterns on every argument' does not always give good results, like we ended up seeing with the enum map change.


instance (ForeignConvention a) => ForeignConvention (Maybe a) where
readForeign (i : args) stk =
readForeign !(i : args) !stk =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bang pattern in !(i : args) doesn't do anything. Pattern matching on a data type is already strict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I removed all the bangs on obvious pattern matching. I also tried removing all the bangs to see what GHC would do and it slowed things down noticeably, so I left the other bangs in place 👍🏼


instance ForeignConvention Text where
readForeign = readForeignBuiltin
{-# INLINE readForeign #-}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these inlines for trivial x = y definitions likely don't do anything but add line count.

Copy link
Contributor Author

@ChrisPenner ChrisPenner Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we definitely do want GHC to inline this, since obviously it's small, and will mean GHC specializes it rather than leaving things polymorphic.

So I'm assuming you're just saying it's redundant because GHC will almost certainly inline it, but in that case, do you have a specific rule for how complex definitions should be before I add an explicit inline?

I can just remove it from all the x = y cases if you like, but personally I'd prefer in this case to leave the inlines on all of them since:

  1. We do want them to be inlined
  2. It's likely to reduce the chances that future ForeignConvention instances accidentally miss an Inline
  3. If it ever gets moved to a different module GHC probably won't inline it without the pragma.

I tried removing all INLINE pragmas and the core definitely changes, but the timings still seem good, so we could also just do that; just need to be careful not to split the typeclass away from the module :)

Just let me know what you'd prefer and I'll try that and benchmark it 👍🏼 😄

Copy link
Contributor

@dolio dolio Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying we don't want them to be inlined. I'm saying INLINE pragmas on x = y definitions are just telling GHC to do things it will decide to do anyway. It's not true that it only does cross-module inlining if you tell it to. And as far as I can tell, type classes don't introduce an obstacle, either.

I'm also not saying that it's not better to err a little on being more explicit. However, the heuristic for cross-module inlining is, "small," and x = y is as small as it gets. So, like, keep in mind what GHC does do (and I checked in this case).

@dolio
Copy link
Contributor

dolio commented Dec 13, 2024

BTW, I didn't meticulously read the big case statement for foreign functions. But I assume it was just cut and paste from stuff that already existed, so probably doesn't require a ton of attention.

@ChrisPenner ChrisPenner force-pushed the cp/inlined-foreign-calls branch from 4e23143 to 33db037 Compare December 13, 2024 19:59
The pruning was causing problems with compiled programs when
inlining was on, because it would prune based on the inlined
code. The inlined code may have certain intermediate combinators
omitted, but those are still necessary to have a full picture of
the source code. Since `compile` was using the MCode numbering
and backing out which References are necessary from that, it
would throw away the source code for these intermediate
definitions. This then caused problems when e.g. cloud (running
from a compiled build) would try to send code to other
environments. It wouldn't have the intermediate terms necessary
for the remote environment to do its own
intermediate->interpreter step.

This new approach does all the 'necessary terms' tracing at the
intermediate level, and then instead determines which MCode level
defintions are necessary from that. This means that the pruning
is no longer sensitive to the inlining. So, it should be safe to
turn inlining back on.
aryairani and others added 2 commits December 16, 2024 12:12
Do reference-based pruning for ucm compile, turn back on inlining
@ChrisPenner ChrisPenner merged commit 94f7c1b into trunk Dec 16, 2024
32 checks passed
@ChrisPenner ChrisPenner deleted the cp/inlined-foreign-calls branch December 16, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants