You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bigslice workers currently store their task outputs locally. These stored outputs may then be read by other workers when needed by direct connections between machines.
When machines are especially flaky, e.g. high spot market contention in EC2, progress on a computation can grind to a halt, as machines are lost frequently enough that a large portion of time is spent recomputing lost results.
Workers could instead write to a more durable shared backing store. If workers are lost, their results would remain available. This would allow computations to always make forward progress at the cost of extra (read: slow) data transfer.
Amazon FSX for Lustre may be a good option, as it's basically designed for this sort of use case:
The open source Lustre file system is designed for applications that require fast storage – where you want your storage to keep up with your compute. Lustre was built to quickly and cost effectively process the fastest-growing data sets in the world, and it’s the most widely used file system for the 500 fastest computers in the world. It provides sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS.
We could also implement something like asynchronous copy to a shared backing store, first preferring worker-worker transfer but falling back to the shared backing store if the machine is no longer available.
It would be good to benchmark various approaches.
The text was updated successfully, but these errors were encountered:
Bigslice workers currently store their task outputs locally. These stored outputs may then be read by other workers when needed by direct connections between machines.
When machines are especially flaky, e.g. high spot market contention in EC2, progress on a computation can grind to a halt, as machines are lost frequently enough that a large portion of time is spent recomputing lost results.
Workers could instead write to a more durable shared backing store. If workers are lost, their results would remain available. This would allow computations to always make forward progress at the cost of extra (read: slow) data transfer.
There is already a nod to implementation in the code. There's work to be done to plumb it through.
Amazon FSX for Lustre may be a good option, as it's basically designed for this sort of use case:
We could also implement something like asynchronous copy to a shared backing store, first preferring worker-worker transfer but falling back to the shared backing store if the machine is no longer available.
It would be good to benchmark various approaches.
The text was updated successfully, but these errors were encountered: