Cancellation model #60

adamw · 2023-03-28T14:31:47Z

adamw
Mar 28, 2023

I've started exploring the codebase & the design, and I have some doubts on the usefulness of the cancellation model.

Let's suppose I'd like to timeout a blocking I/O operation (like reading from a socket). Using the current code, that would translate to sth like:

Future:
  val f1 = Future { Thread.sleep(1000); throw new TimeoutException() }
  val f2 = Future(blockingIOOperation())
  f1.alt(f2).value

However, that won't work, as the cancellation only impacts reading the value of a Source, not the actual computation: even though the alt will complete after 1s, the blocking operation will continue, still waiting on the socket or such. (And for a good reason: it cannot impact the computation, as Future isn't bound to any thread/fiber, virtual or not.)

Additionally, if we are to follow the structured concurrency approach, we cannot leak any running threads/fibers outside the scope of Future:: so once the overall future completes, any computations that started must somehow complete. The only way to guarantee this here is to wait until both branches finish (defeating the whole purpose of alt: as there's no way to interrupt the sleep, we'll always wait at least 1s).

That's described e.g. on Wikipedia:

The core concept is the encapsulation of concurrent threads of execution (here encompassing kernel and userland threads and processes) by way of control flow constructs that have clear entry and exit points and that ensure all spawned threads have completed before exit.

but also in the introductory articles.

On the other hand, if we let futures leak, there's no structured concurrency, and we are resource-unsafe.

odersky · 2023-03-29T11:56:07Z

odersky
Mar 29, 2023
Maintainer

Timers are currently missing, and should be added to the framework. With timers, I would imagine one of the following scenarios:

Future:
  val f2 = Future(blockingIOOperation())
  await(race(f2, timeout(1000))

We assume that timeout(n) returns a Failure[TimeoutException]; if it doesn't we need to map the timeout async source accordingly.

Or, we could package that into a timeout method on Future itself , then we could write:

Future(blockingIOOperation()).timeout(1000)

So, in short, timeouts need to be handled specially, they are not futures.

About the structured concurrency policy in general: It's debatable whether an enclosing future should always wait until all nested futures have completed, or whether it's enough to just make cancellation requests for all nested futures that are still executing at that point and return immediately. Essentially, we are trading liveness for safety properties here and no policy is best in all cases. So I think we need one policy to be the default with the other policy being an opt-in. Which is which I am not sure yet. Following structured concurrency to the letter means that rogue computations can deadlock the system. Returning with just cancellation requests means that resources might still be blocked when a future returns. Probably, waiting for completion is best as the default. What's your opinion?

0 replies

He-Pin · 2023-03-29T17:50:15Z

He-Pin
Mar 29, 2023

For timer, what kind of timer is going to be added？A scheduledExecutorService liked one for every core，or a HashedWheelTimer based one？

And in jdk 20, there is a StructuredTaskScope, and support cancellation propagation which will help for not leaking. Cpu time is kind of resource too.

StructuredTaskScope.ShutdownOnSuccess/ShutdownOnFailure。

0 replies

odersky · 2023-03-29T19:01:36Z

odersky
Mar 29, 2023
Maintainer

@He-Pin I think we are open to suggestions. I know nothing about timers, just that they exist.

0 replies

djspiewak · 2023-03-29T21:55:02Z

djspiewak
Mar 29, 2023

So, in short, timeouts need to be handled specially, they are not futures.

I'm not sure I buy this. Timeouts are one of the primary use cases for cancelation in practice (they're absolutely the most frequent use of cancelation in production applications since every socket has at least one associated timer, and often more than one), and they have a tendency to show up in a lot of composed variants. Pure races in the sense that you may be envisioning with alt are not common in practice, except when acting as an encoding mechanism for… other encodings of timeouts (e.g. racing a Deferred#get with a more useful effect and using the Deferred as an external kill switch).

Fwiw, I quite like @adamw's example as a way of testing the limits of the framework as it stands, though it's worth noting that Thread.sleep is kind of unfair as an example since the only way to cancel this type of calculation is Thread#interrupt(). Shifting to something like a scheduled executor on the backend would give you async timers (though you'll probably have to reinvent Cats Effect's Scheduler since there is no analogue for ExecutionContext in the stdlib), but just shifting to an async execution model won't be enough to make cancelation fully functional.

Cancelation in these types of systems (as end-users often conceptualize it) is really two things: sequence preemption and async interruption. The former is implemented by Cats Effect in the form of cancel checks on IO stages (most notably, flatMap), while the latter is implemented by Cats Effect in the form of a three-way race condition in the IO.async internals, and by Thread in the form of InterruptedException on some (but sadly not all) hard-blocking operations. If I understand this repository correctly, the former is implemented (by checking cancelation on .await, which corresponds to Cats' flatMap), but the latter is not implemented at all. This is the real meat of what Adam is pointing out here.

About the structured concurrency policy in general: It's debatable whether an enclosing future should always wait until all nested futures have completed, or whether it's enough to just make cancellation requests for all nested futures that are still executing at that point and return immediately. Essentially, we are trading liveness for safety properties here and no policy is best in all cases. So I think we need one policy to be the default with the other policy being an opt-in. Which is which I am not sure yet. Following structured concurrency to the letter means that rogue computations can deadlock the system. Returning with just cancellation requests means that resources might still be blocked when a future returns. Probably, waiting for completion is best as the default. What's your opinion?

You're quite correct that there's no right answer here. You're trying to decide between resource safety and deadlock safety, and it's not possible in general to have both. Having used various systems which bias in each direction for about a decade, I can tell you that I'm 100% in the camp of biasing in favor of resource safety. Deadlocks are usually pretty easy to track down with the tools provided by modern effect systems and tend to happen under normal operating conditions (not always, but usually). Conversely, the consequences of resource unsafety are leaks, crashes, and failure to self-heal under pressure. These issues manifest in production under the worst possible situations: when the system is under significant pressure due to some other externality. In other words, biasing away from resource safety causes the system to exhibit the worst-case behavior at the worst possible moment, whereas biasing away from deadlock safety causes the system to exhibit the worst-case behavior at (generally) more opportune times.

With that said, strictly following structured concurrency is generally too restrictive in practice, and it makes it impossible to experiment and innovate around resource scoping and lifecycles (Fs2 is an excellent example of this: the entire framework is structured, but its implementation is in terms of unstructured primitives; Cats Effect's Supervisor is a simpler and more self-contained example). I generally agree that structured concurrency should be the user-facing default in all cases, but denying access to unstructured primitives is too restrictive a stance from a framework which is as low-level as this one. My recommendation would be to retain the backpressure semantic for combinators (including alt), but think carefully about an unstructured forking operation (similar to Cats Effect's start).

0 replies

He-Pin · 2023-03-30T03:25:23Z

He-Pin
Mar 30, 2023

@alexandru @viktorklang @jdegoes ping~ would be great to have your thoughts too

0 replies

odersky · 2023-03-30T08:02:05Z

odersky
Mar 30, 2023
Maintainer

@djspiewak Thanks for your comments. This repo is still a strawman which means, made to be knocked down 😉. So it's really valuable to get constructive criticism at this early state.

I did not fully understand the point about async interruption:

while the latter is implemented by Cats Effect in the form of a three-way race condition in the IO.async internals, and by Thread in the form of InterruptedException on some (but sadly not all) hard-blocking operations. If I understand this repository correctly, the former is implemented (by checking cancelation on .await, which corresponds to Cats' flatMap), but the latter is not implemented at all. This is the real meat of what Adam is pointing out here.

Did you have in mind that when cancelling a future, we also should do a Thread.interrupt() to get the future out of states where it blocks in some JDK defined way? I agree, this would be necessary. So far, integration with the Java thread model is very rudimentary. What other things need to be done?

I think I also agree with you that structured concurrency should be the default. In fact, if we implement that strategy, the opt-out is already available, since we can unlink a cancellable, which means it will not get cancelled when its scope is completed.

0 replies

odersky · 2023-03-30T08:43:03Z

odersky
Mar 30, 2023
Maintainer

See db25ad4 in #13 for changes that implement the structured concurrency rules for cancellation. (As always so far, totally ignoring optimizations).

0 replies

djspiewak · 2023-03-30T17:09:47Z

djspiewak
Mar 30, 2023

Did you have in mind that when cancelling a future, we also should do a Thread.interrupt() to get the future out of states where it blocks in some JDK defined way? I agree, this would be necessary. So far, integration with the Java thread model is very rudimentary. What other things need to be done?

There are a couple angles to this.

First, for a Future which is defined by a blocking or compute-bound body (RunnableFuture), you could consider running a Thread#interrupt() on cancelation, but my strong recommendation would be to only do this if the user has indicated this is in fact the semantic they want. Interruption is regularly handled completely incorrectly by third-party code, and in many cases handled in a fashion which leaks resources or otherwise leaves systems in invalid states. Cats Effect addresses this by having two separate constructors, blocking and interruptible, with the former pool-shunting but not supporting cancelation, while the latter pool-shunts and interrupts on cancel. (note these are both distinct from delay, which does not pool shunt and is used for things that are compute-bound but not blocking).

Second, for any Future composed of other Futures, regardless of how those constituents were constructed or scheduled, you need to be able to interrupt the sequencing (i.e. the fiber/coroutine/virtual thread/thingy). This is the only form of cancelation implemented today in the strawman, since await checks the cancelation state.

Third, for futures which are constructed via Promise, you need a way to cancel the asynchronous suspension in a fashion which immediately signals back to the registrant that resources should be cleaned up. This is what @adamw's example is going to, and you will need this functionality in order to implement timers. Reading the source code, the comments suggest that asynchronous cancelation will eventually (it is not implemented at present) be handled by throwing a CancellationException from Promise#complete, but this isn't really sufficient. The problem is that this will only signal cancelation when the asynchronous action completes, meaning that the Promise and any resources associated with the async action itself (e.g. timer state) will leak until they complete naturally. Conceptually, what you're trying to do is you're trying to create a signaling mechanism for each Promise which the end-user will use to call a removeListener function (or similar).

This third mechanism btw tends to be very subtle and complex, because you have to juggle three possible legs to a race condition:

The fiber is externally canceled
The Promise is completed
The fiber which created the Promise hits the suspension point (i.e. where it is descheduled from the thread and begins asynchronously awaiting the promise completion)

All three of these things can interleave in any order on separate threads (and the second one tends to happen on third-party threads, so you really have no control over it).

Oh as an aside, the listener notification in complete needs to be thread-shifted, otherwise you end up leaking third-party threads (e.g. NIO workers) back into application code and it gets really hard for end-users to control execution. But this is orthogonal to cancelation.

0 replies

odersky · 2023-03-30T19:23:32Z

odersky
Mar 30, 2023
Maintainer

Thanks for this crash course in low-level scheduling. Very much appreciated! So if I understood correctly.

blocking for externally blocking tasks: execute future on a separate thread.
interruptible for regular tasks: execute future in the thread pool, interrupt on cancel and wait for completion.
delay for tasks that survive their scope: execute future but don't cancel when its group gets cancelled.

We currently have interruptible and delay (via unlink), but not blocking. blocking looks like an essential addition. Without it we'd have to fall back on plain threads for externally blocking computations, which would be ugly.

For promises, I was thinking of just setting a flag in the future value of a promise (or in the promise itself) that cancellation was requested and let the thread(s) that fulfil the promise poll that. I don't see what else one could do; certainly raising an exception in complete will not suffice (or even be helpful). Hopefully systems will consist mostly of RunnableFutures, which are functional, rather than Promise#futures, which are not.

Oh as an aside, the listener notification in complete needs to be thread-shifted, otherwise you end up leaking third-party threads (e.g. NIO workers) back into application code and it gets really hard for end-users to control execution. But this is orthogonal to cancelation.

I am not sure. Note that it's only the notification logic itself which is executed on the thread of the completing future. The continuation of the notified future is in any case scheduled on a different thread. The notification logic should usually run in a very short time span.

0 replies

djspiewak · 2023-03-30T23:38:30Z

djspiewak
Mar 30, 2023

Thanks for this crash course in low-level scheduling. Very much appreciated!

Happy to nerd out on it any time! There's a lot more here to explore and it's a lot of fun to wade through.

blocking for externally blocking tasks: execute future on a separate thread.

Yep! And to clarify, these tasks would not be cancelable. Or rather, cancelation would (asynchronously) wait until the task finished (to preserve structural concurrency).

interruptible for regular tasks: execute future in the thread pool, interrupt on cancel and wait for completion.

Almost. Thread#interrupt really only makes sense for blocking tasks, because of how it tends to be handled by most third-party code (in particular, very few people actually check Thread.interrupted, and most who do, do it incorrectly). So you want interruptible for tasks that are blocking and you want to try to make them cancelable by implementing the interrupt protocol.

delay for tasks that survive their scope: execute future but don't cancel when its group gets cancelled.

Nope. delay for regular tasks which otherwise have normal scoping rules but which don't block the thread. In other words, there's a distinction here between compute-bound tasks and I/O-bound tasks which block: both eat up a thread, but the former needs CPU time to make progress while the latter only needs external time (clock or I/O). delay vs blocking allows the user to express this distinction at construction time.

blocking looks like an essential addition. Without it we'd have to fall back on plain threads for externally blocking computations, which would be ugly.

I think so too.

For promises, I was thinking of just setting a flag in the future value of a promise (or in the promise itself) that cancellation was requested and let the thread(s) that fulfil the promise poll that. I don't see what else one could do; certainly raising an exception in complete will not suffice (or even be helpful). Hopefully systems will consist mostly of RunnableFutures, which are functional, rather than Promise#futures, which are not.

There's a bunch of things to unpack in here. Starting from the last bit…

You actually want most of the system to consist of Promise#futures (and futures built from Promise#futures via async/await) rather than RunnableFutures. The reason for this comes back to the reason for asynchrony in the first place: ultimately it all comes down to multiplexing I/O interrupts. In a hypothetical system where you never needed asynchrony (and thus, all futures can be RunnableFuture), you would also never be performing I/O, and so you either wouldn't really need parallelism or your parallelism would be entirely compute-bound. At that point, you're much better off with a data-parallel framework like Spark, good old parallel collections, or even something low level like OpenCL or even SIMD. It really would be an entirely different problem space.

The parallelism which is enabled by this sort of concurrency is I/O parallelism, where the motivating use-case is scatter/gather workflows (e.g. handling many thousands of connections, all coming from a single server socket, each of which in turn results in many more thousands of upstream connections, all of which should be parallelized). Each of those I/O events will either be blocking (in which case, it eats a thread and can't be done in parallel without starving the system of resources), or non-blocking (in which case, it is defined in terms of Promise#future). This is what I mean by "you really want Promise#future to be most of the system".

Regarding the async cancelation mechanism, it's definitely possible to do better, but probably not with the Promise API. Cats Effect resolves this at a low level by simply having an onCancel primitive which works even when applied to the equivalent of await on a Promise#future. This is generally wrapped though by a more user-friendly primitive which does away with Promise-like things altogether in favor of a lexically scoped constructor, which takes the form of IO.async. The type signature is a bit squint-inducing, but it does solve the issue: IO.async[A]((Either[Throwable, A] => Unit) => IO[Option[IO[Unit]]]): IO[A] The Option[IO[Unit]] represents the cancelation action for the async registration, which can then proactively and immediately clean up resources. For example, here is essentially how sleep is typically implemented:

val executor: ScheduledExecutorService = ???   // make one of these per process

def sleep(d: FiniteDuration): IO[Unit] =
  IO.async[Unit] { cb =>
    IO {
      val fut = executor.schedule(() => cb(Right(())), d.length, d.unit)
      Some(IO(fut.cancel()))
    }
  }

The fact that we can call fut.cancel() immediately as an inseparable part of the cancelation is the important bit. This is what ties the backpressure together, avoids resource leaks, and ensures the system remains safe. Under the surface, this fut.cancel() is implemented by ScheduledExecutorService to remove the timer callback (() => cb(Right(()))) from its internal data structure, eliminating memory leaks and making the other timer ticks more efficient. If we wait until the timer fires, we could be leaking for a very long time before cleanup. If we ask the executor to poll something, we burn an extra thread, and there's no particularly natural place to put that machinery.

The continuation of the notified future is in any case scheduled on a different thread.

Is that the case? I didn't see that in the code but I also didn't read everything front to back. It looks to me like any suspended awaits are fired like listeners and sequentially executed, without any proactive yielding back to the underlying thread pool. That would in turn mean that continuations would tend to stick to whatever thread calls complete up to the next asynchronous boundary.

0 replies

adamw · 2023-03-31T13:02:36Z

adamw
Mar 31, 2023
Author

Timers are currently missing, and should be added to the framework. With timers, I would imagine one of the following scenarios:

Timers are one example, but you could also have an example without timers:

Future:
  val f1 = Future(blockingIOOperation1())
  val f2 = Future(blockingIOOperation2())
  f1.alt(f2).value

e.g. racing cache retrieval with a DB query. When the faster one succeeds, you want the other one interrupted. And I'd say that it should be interrupted in a way that actually stops and releases the network socket.

As a side question: when cancelling a future, should the future.cancel or future.await block until the future is actually complete? I think that's the case e.g. in ZIO, but not in cats.

About the structured concurrency policy in general: It's debatable whether an enclosing future should always wait until all nested futures have completed, or whether it's enough to just make cancellation requests for all nested futures that are still executing at that point and return immediately.

The way I understood structured concurrency, is that we should always wait for the nested futures/threads to complete. But of course I'm authoritative here :). Though without that property, we don't have backpressure. And that's a property that is often standard (e.g. the whole reactive "movement"). I definitely agree with Daniel, that deadlocks are easier to reproduce/debug than running out of resources.

You actually want most of the system to consist of Promise#futures (and futures built from Promise#futures via async/await) rather than RunnableFutures.

I'm not sure I understand why that's the case. But maybe I misunderstand what's the target platform. From what I've imagined, at least as far as the JVM is concerned, a direct-style concurrency API would target a Loom-based runtime. There, it's "normal" to block (virtual) threads, and it's "normal" to do cancellation via interruption. In fact, that's the only possible way.

I don't think the goal of a project like async would be to rewrite the cats async runtime (as we could just use cats then, to a better effect). So in this setting, using promise-futures would in fact be inferior to runnable futures: the calculation computing the value of the promise-future can't be cancelled (as we don't know where this computation happens), while with a runnable future, we should interrupt the thread. And since we have cheap virtual threads, creating runnable futures is not a problem.

This of course skips over the issues with the Java interruption model, fairness of virtual thread scheduling and I think I also read that Daniel mentioned a hidden unbounded thread pool when using io_uring. But to fix these I think the solution is to use an effect system with a custom runtime (such as cats or ZIO).

0 replies

SystemFw · 2023-03-31T15:57:51Z

SystemFw
Mar 31, 2023

when cancelling a future, should the future.cancel or future.await block until the future is actually complete? I think that's the case e.g. in ZIO, but not in cats.

In cats-effect (and ZIO) when you cancel an IO, cancel waits until the IO is actually cancelled and all its finalisers have run. In short, we say that cancel backpressures on finalisers. This is an important property because otherwise you cannot nest bracket-like constructs (try/finally, Resource, Using etc): without backpressure, on cancelation the finalisers would be fire-and-forget, which means you cannot guarantee release in reverse order of acquisition, and risk close-during-use crashes. Therefore I think that:

or whether it's enough to just make cancellation requests for all nested futures that are still executing at that point and return immediately.

is not enough.
Note however that this doesn't mean that after canceling you have to manually join, the backpressuring is embedded in the semantics of cancel itself.

0 replies

odersky · 2023-03-31T16:16:10Z

odersky
Mar 31, 2023
Maintainer

@djspiewak

Nope. delay for regular tasks which otherwise have normal scoping rules but which don't block the thread. In other words, there's a distinction here between compute-bound tasks and I/O-bound tasks which block: both eat up a thread, but the former needs CPU time to make progress while the latter only needs external time (clock or I/O). delay vs blocking allows the user to express this distinction at construction time.

So delay tasks also execute on their own newly created thread, but won't get an interrupt on cancellation, like blocking tasks do? Did I get that right?

I find your take on promised vs runnable futures interesting. I'm learning that promised futures cannot be neglected. But there are also scenarios where runnable futures are dominant. One is compute bound things. Runnable futures are the natural building block for dataflow parallelism. I would imagine that higher level algorithms such as parallel collections could be built on them. Also, anything we do with an externally fulfilled future takes place in a runnable future. So promised futures should only be at the extreme points of a system where it interacts with the external world. But yes, cancelling them is also important. I'll look deeper into the solution you propose.

@adamw

e.g. racing cache retrieval with a DB query. When the faster one succeeds, you want the other one interrupted. And I'd say that it should be interrupted in a way that actually stops and releases the network socket.

Good points. We need to look into interrupts in more detail.

As a side question: when cancelling a future, should the future.cancel or future.await block until the future is actually complete? I think that's the case e.g. in ZIO, but not in cats.

Right now (i.e. with #13), future.cancel() does not block, but the enclosing completion scope will only complete when the future has completed. On complete, a future unlinks itself from its completion group. The completion group will only complete once there are no more uncompleted members. This means we get structured concurrency but can at the same time cancel a bunch of futures in parallel.

Maybe there should also be an ensureCancelled method that does blocking cancel directly.

0 replies

jackcviers · 2023-03-31T18:30:42Z

jackcviers
Mar 31, 2023

So promised futures should only be at the extreme points of a system where it interacts with the external world.

Of note, in Kotlin Coroutines, Job and launch seem to serve very similar purposes as what has been done with Blocking and CompletionGroup in #13.

0 replies

jdegoes · 2023-05-02T06:36:13Z

jdegoes
May 2, 2023

@He-Pin

Since you pulled me into this thread, I'll share my thoughts:

I think that designing a primitive for concurrent, async + blocking computation, which supports structured concurrency, runs efficiently, is properly backpressured, supports timeouts and races with sane semantics, and so forth, is a very complex and error-prone undertaking, ultimately benefiting from detailed knowledge of and experience in low-level systems programming, as well as concurrent programming, async programming, and I/O scheduling (to name a few!).

Looking at the design of the async library in its current and earlier forms, and wading through this thread, I believe that that async makes a number of design choices that are known to be suboptimal with respect to several measurable dimensions (including minimization of fiber/thread leaks, compatibility with high-performance schedulers, etc.).

Since in the Scala community we are fortunate enough to have collective experience building this machinery, and iterating on it over a period of years, in response to user feedback, it would be a shame to not take advantage of all this domain expertise.

I highly recommend that development of async proceed in a way that is carefully geared at leveraging this expertise: detailed consultation with authors of existing effect systems, including @djspiewak, @alexandru, @adamgfraser, etc. As this thread demonstrates, I am sure you will find these authors to be happy to help fill in gaps and explain issues with the current design.

These wheels have been invented many times before, and this reinvention, if it must happen within EPFL (why?), should learn from the other inventions to enjoy the best possible outcome.

0 replies

odersky · 2023-05-02T09:35:51Z

odersky
May 2, 2023
Maintainer

This is a very early prototype, meant as a feasibility study and to explore the design space of possible APIs. We very much invite feedback and suggestions for all areas, in particular concerning scheduling and cancellation. The Scala community has a lot of expertise to offer, on which we want to draw.

I believe there is also the important space of low-level concurrency mechanisms where it would make sense to share code between different effect systems.

I intentionally talked early about this project at Scalar in order to draw the community's attention to it, and profit from their input.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancellation model #60

{{title}}

Replies: 16 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Cancellation model #60

Replies: 16 comments

odersky Mar 29, 2023 Maintainer

odersky Mar 29, 2023 Maintainer

odersky Mar 30, 2023 Maintainer

odersky Mar 30, 2023 Maintainer

odersky Mar 30, 2023 Maintainer

adamw Mar 31, 2023 Author

odersky Mar 31, 2023 Maintainer

odersky May 2, 2023 Maintainer

odersky
Mar 29, 2023
Maintainer

odersky
Mar 29, 2023
Maintainer

odersky
Mar 30, 2023
Maintainer

odersky
Mar 30, 2023
Maintainer

odersky
Mar 30, 2023
Maintainer

adamw
Mar 31, 2023
Author

odersky
Mar 31, 2023
Maintainer

odersky
May 2, 2023
Maintainer