-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static Foreign Calls #5495
Static Foreign Calls #5495
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the builtin implementations moved to Unison.Runtime.Foreign.Function; the builtin names are in Unison.Runtime.Foreign.Function.Type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module has been replaced wholesale with the big ol' case statement of builtin implementations
@@ -278,6 +285,7 @@ argsToLists = \case | |||
VArgR i l -> take l [i ..] | |||
VArgN us -> primArrayToList us | |||
VArgV _ -> internalBug "argsToLists: DArgV" | |||
{-# INLINEABLE argsToLists #-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since argsToLists
is in a different module from where it's used, you need to explicitly tell GHC to expose it's implementation so it can be inlined, which helps to fuse away the lists.
pure stk | ||
bprim1 !stk TIKR i = do | ||
bprim1 !env !stk RRFC i | ||
| sandboxed env = die "attempted to use sandboxed operation: Ref.readForCAS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this sandboxing check as part of the previous instructions PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good. I left a couple notes, though.
I'm not really a fan of just sprinkling around 'optimization' stuff that probably or definitely doesn't do anything. It makes it harder for people to know/learn what annotations and such actually matter. And often you do need to actually know and investigate whether they actually matter to get good results, because just following a simple script like 'bang patterns on every argument' does not always give good results, like we ended up seeing with the enum map change.
|
||
instance (ForeignConvention a) => ForeignConvention (Maybe a) where | ||
readForeign (i : args) stk = | ||
readForeign !(i : args) !stk = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bang pattern in !(i : args)
doesn't do anything. Pattern matching on a data type is already strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I removed all the bangs on obvious pattern matching. I also tried removing all the bangs to see what GHC would do and it slowed things down noticeably, so I left the other bangs in place 👍🏼
|
||
instance ForeignConvention Text where | ||
readForeign = readForeignBuiltin | ||
{-# INLINE readForeign #-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these inlines for trivial x = y
definitions likely don't do anything but add line count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we definitely do want GHC to inline this, since obviously it's small, and will mean GHC specializes it rather than leaving things polymorphic.
So I'm assuming you're just saying it's redundant because GHC will almost certainly inline it, but in that case, do you have a specific rule for how complex definitions should be before I add an explicit inline?
I can just remove it from all the x = y
cases if you like, but personally I'd prefer in this case to leave the inlines on all of them since:
- We do want them to be inlined
- It's likely to reduce the chances that future ForeignConvention instances accidentally miss an Inline
- If it ever gets moved to a different module GHC probably won't inline it without the pragma.
I tried removing all INLINE pragmas and the core definitely changes, but the timings still seem good, so we could also just do that; just need to be careful not to split the typeclass away from the module :)
Just let me know what you'd prefer and I'll try that and benchmark it 👍🏼 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying we don't want them to be inlined. I'm saying INLINE
pragmas on x = y
definitions are just telling GHC to do things it will decide to do anyway. It's not true that it only does cross-module inlining if you tell it to. And as far as I can tell, type classes don't introduce an obstacle, either.
I'm also not saying that it's not better to err a little on being more explicit. However, the heuristic for cross-module inlining is, "small," and x = y
is as small as it gets. So, like, keep in mind what GHC does do (and I checked in this case).
BTW, I didn't meticulously read the big case statement for foreign functions. But I assume it was just cut and paste from stuff that already existed, so probably doesn't require a ton of attention. |
4e23143
to
33db037
Compare
The pruning was causing problems with compiled programs when inlining was on, because it would prune based on the inlined code. The inlined code may have certain intermediate combinators omitted, but those are still necessary to have a full picture of the source code. Since `compile` was using the MCode numbering and backing out which References are necessary from that, it would throw away the source code for these intermediate definitions. This then caused problems when e.g. cloud (running from a compiled build) would try to send code to other environments. It wouldn't have the intermediate terms necessary for the remote environment to do its own intermediate->interpreter step. This new approach does all the 'necessary terms' tracing at the intermediate level, and then instead determines which MCode level defintions are necessary from that. This means that the pruning is no longer sensitive to the inlining. So, it should be safe to turn inlining back on.
Do reference-based pruning for ucm compile, turn back on inlining
Overview
Foreign Calls were being dispatched dynamically. This puts all foreign calls in a big'ol case statement and inlines all the ForeignConvention typeclasses.
Interestingly, The first iteration of this PR was slower than trunk, but explicitly NOINLINE'ing the foreign calls (with a wrapper to ensure Stack gets unboxed) sped it up significantly. This implies that code size of
eval
was affecting things code-caching significantly, which isn't too surprising considering these are all microbenchmarks. So we should be careful about how much code we have in our main loop; maybe even should consider splitting off some of the lesser used instructions into their own chunk of code.It's actually quite surprising that this change now speeds up the regular suite, despite it not using foreign calls for most of its benches.
Implementation notes
Test coverage
Benchmarks
trunk -> new
Normal suite, I wouldn't have expected these to change much since they don't use foreigns very much, but it looks like things still somehow improved a bit 🎉
Loose ends
Probably worth investigating the overhead of the sandbox checks in the interpreter, we may be able to remove them for instructions as well.