Support shared backing store #99

jcharum · 2020-07-27T17:28:50Z

Bigslice workers currently store their task outputs locally. These stored outputs may then be read by other workers when needed by direct connections between machines.

When machines are especially flaky, e.g. high spot market contention in EC2, progress on a computation can grind to a halt, as machines are lost frequently enough that a large portion of time is spent recomputing lost results.

Workers could instead write to a more durable shared backing store. If workers are lost, their results would remain available. This would allow computations to always make forward progress at the cost of extra (read: slow) data transfer.

There is already a nod to implementation in the code. There's work to be done to plumb it through.

Amazon FSX for Lustre may be a good option, as it's basically designed for this sort of use case:

The open source Lustre file system is designed for applications that require fast storage – where you want your storage to keep up with your compute. Lustre was built to quickly and cost effectively process the fastest-growing data sets in the world, and it’s the most widely used file system for the 500 fastest computers in the world. It provides sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS.

We could also implement something like asynchronous copy to a shared backing store, first preferring worker-worker transfer but falling back to the shared backing store if the machine is no longer available.

It would be good to benchmark various approaches.

jcharum added the enhancement New feature or request label Jul 27, 2020

jcharum mentioned this issue Jul 27, 2020

Support cluster shrinking #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support shared backing store #99

Support shared backing store #99

jcharum commented Jul 27, 2020

Support shared backing store #99

Support shared backing store #99

Comments

jcharum commented Jul 27, 2020