-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cancel
in exception handling code
#96
base: master
Are you sure you want to change the base?
Fix cancel
in exception handling code
#96
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is a bit contrived, and possibly is just an abuse of the API. I've recommended that we switch to catch
instead of withException
as I wasn't aware the semantics were different.
To motivate it a bit more: our original code looked like this:
action `wtihException` sendAlert
sendAlert = runDB . sendMessageToSlack MyAlertChannel
Meanwhile, we've got some instrumenting code in runDB
that adds to a timer when we're waiting on a connection.
runDB action = do
let
timerAction =
forever do
threadDelay 1_000
recordAnotherMillisecond
withAsync (timerAction `finally` flushMetrics) \a -> do
runSqlPool (liftIO (cancel a) >> action) =<< askSqlPool
But, since withAsync
can't actually be canceled, if we run any database action in a onException
or withException
handler, then we just... hang forever.
This can be fixed locally, using withAsyncWithUnmask (\unmask -> unmask $ timerThread 0) \a -> ...
This seems deeply unsatisfying to me. But hey it's Haskell exceptions 😂
unliftio/src/UnliftIO/Exception.hs
Outdated
@@ -422,7 +422,7 @@ withException thing after = withRunInIO $ \run -> EUnsafe.uninterruptibleMask $ | |||
case res1 of | |||
Left e1 -> do | |||
-- see explanation in bracket | |||
_ :: Either SomeException b <- EUnsafe.try $ run $ after e1 | |||
_ :: Either SomeException b <- EUnsafe.try $ restore $ run $ after e1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this line fixes the test. But it also removes the guarantee that action `onException` cleanup
will definitely complete cleanup
.
when (n < 10) $ do | ||
timerAction (n + 1) | ||
withAsync (timerAction 0) $ \a -> do | ||
cancel a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be replaced with pure ()
. withAsync
is also unable to cancel
the thread.
withAsync (timerAction 0) $ \a -> do | ||
cancel a | ||
eresult <- | ||
race |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This race
call also is unable to cancel one of the threads - so, with the original code, this test hangs for 10 seconds waiting for timerAction
to complete before returning Right 10
.
withAsync a b = withRunInIO $ \run -> do | ||
maskingState <- E.getMaskingState | ||
case maskingState of | ||
E.MaskedUninterruptible -> | ||
A.withAsyncWithUnmask (\unmask -> unmask (run a)) (run . b) | ||
_ -> | ||
A.withAsync (run a) (run . b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an alternative fix to using restore
in the handler of withException
and bracket
.
The Async
API is totally dependent on being able to throw async exceptions to the other thread. I don't think Async
makes any sense if it cannot do that.
While this single fix works here, it won't work for eg race
or any of the other code, so we may want to patch those as well.
I'm almost curious if this is a thing that needs to get raised with upstream Async
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented here simonmar/async#67
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be very worried about two things here:
- A very subtle change to behavior that can take production code down
- Diverging from
async
's handling of these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream hasn't responded, so, we may want to handle this here?
I have a hard time imagining any code that depends on withAsync
creating an unkillable zombie and being unable to exit.
This is a decently large footgun that is only really likely to happen if you are using unliftio
or safe-exceptions
. While I do think that async
should really be launching threads in at most a MaskedInterruptible
state, it's not super likely to be triggered - uninterruptibleMask
is pretty rare to see, and has lots of warning labels.
Since unliftio
uses it a bunch in cleanup functions, though, I think it does make sense to make the combination of UnliftIO.Exception
and UnliftIO.Async
not quite so dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following why this situation is worse with unliftio
or safe-exceptions
, it seems like other code would have the same behavior.
I'm still very much on the fence here. I think the change is a good one, but the idiosyncrasies it may cause may not be worth it.
One specific question about the implementation here: IIUC, this is changing uninterruptible masking to unmasked, but leaving interruptible masking as it is. That seems surprising. Should uninterruptible become interruptible, or both become unmasked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following why this situation is worse with unliftio or safe-exceptions, it seems like other code would have the same behavior.
It's unlikely because most people read the docs for uninterruptibleMask
and decide they probably don't need it. So the main way it becomes a part of people's programs is when safe-exceptions
or unliftio
style cleanups enter the picture and puts it in with bracket
and friends.
But, yeah, anyone that runs uninterruptibleMask_ $ async $ do ...
is going to create an unkillable Async
.
I'm still very much on the fence here. I think the change is a good one, but the idiosyncrasies it may cause may not be worth it.
I'm coming around that it should probably be unmask
ed in the async
and withAsync
functions, and definitely unmasked in race
and race
-like functions. If the intent is "one of these threads should die," then unmasking should happen. But concurrently
poses a slightly different question: Simon Marlow points out that finally foo (a >> b)
and finally foo (concurrently a b
should have the same semantics, but if concurrently
unmasks, then the action is no longer protected. So the appropriate default for concurrently
may be different than for race
, since the expectations/assumptions are different.
Likewise, withAsync
seems to demand the unmasking much more - uninterruptibleMask_ $ withAsync foo bar
cannot terminate if foo
is a forever
loop - you're deadlocked. But uninterruptibleMask_ $ async foo
is less likely to deadlock, since it's entirely possible that you never cancel
it, and thus never have the problem.
I'm not really sure the best way to approach it. Maybe separate modules with different behavior is the right answer? A separate opt-in package?
My chaos brain says something like UnliftIO.Async
needs to export these functions with a MaskingBehavior
argument, forcing all users to upgrade, and maybe give a warning/deprecation message that points to alternative modules. But that's a big breaking change for probably little benefit for those that aren't already experiencing the pain.
One specific question about the implementation here: IIUC, this is changing uninterruptible masking to unmasked, but leaving interruptible masking as it is. That seems surprising. Should uninterruptible become interruptible, or both become unmasked?
Yeah, that's a good pont - unfortunately, we're missing a primop.
I have opened an upstream issue: ideally, we'd have a primop unsafeUnUninterruptibleMask#
that would downgrade a UninterruptibleMask
to a InterruptibleMask
. But we don't have that, so the alternative of asyncWithUnmask $ \unmask -> unmask $ mask_ action
actually is vulnerable to an async exception at the unmask
point, since unsafeUnmask
(the function provided as unmask
) triggers any queued async exceptions in the C--. In the case of async, this is really unlikely - I think you'd need to cancel =<< async action
to observe the async exception slipping between the unmask $ mask_
.
Switching from mask_
(which checks masking state; should be unnecessary after unmask
) to block
would also accomplish this.
Hate to be a squeaky wheel, but this has now bitten us 3 times, resulting in non-graceful SIGKILL shutdowns I've made the patch locally to ensure that all |
Fair enough, let's try to move forward on this then. Can I ask you to put together a ChangeLog entry in here that makes it clear that there's a subtle behavioral change in this release? Also: what do you think of posting this on the Haskell Foundation Discourse and giving people a heads-up that this change will be coming before releasing? I'm a bit worried about encouraging a debate on it, but also a bit worried about this one sneaking under people's radars. |
I think getting some feedback would be great. At the very least, maybe we can get some more eyes on this ticket: https://gitlab.haskell.org/ghc/ghc/-/issues/22389 the solution to which would introduce a new primitive that should make this a bit safer. |
Would you be able to publish such a message? I'll be happy to boost with links from Twitter. |
Initial discussion: fpco/safe-exceptions#3
Follow-up on GHC: https://gitlab.haskell.org/ghc/ghc/-/issues/18899
🤔
Fixes #95