Never stalling is an issue #402

martinsumner · 2023-03-02T19:27:31Z

Under heavy PUT load, the penciller may enter into a backlog state. In this state, updates from the bookie to the penciller will be "returned", however updates to the bookie can continue and will be backlogged within the bookie's ETS table (aka the ledger cache).

Once a backlog occurs, and PUTs continue, this will eventually result in the slow_offer state being perpetual. However implementation of how to handle the slow_offer state is the responsibility of the calling application (e.g. the riak_kv_leveled_backend logic in Riak, which will pause for 10ms).

However, if the response of the application implementation is deficient (which can be the case in Riak - 10ms pause does not appear to be enough), a vicious circle can develop. For example, if there is a long-lived backlog in the penciller of say 5 minutes, even with a 10ms pause every put, 30K PUTs can be received in this state. If there are 20 index changes per PUT, that would mean 600K key changes are now backed up in the bookie ledger cache.

Once the penciller backlog is complete - it is immediately in receipt of this backlog - and is immediately in a backlogged state. Further to this, writing a 600K L0 file will be time consuming, and merging it into L1 may take 20 x longer than a normal merge. The problem only gets worse.

If the backend_pause is well-tuned, this can be avoided. However, in this case would it have been preferable to stall the backend entirely to prevent the vicious circle? Once the ledger cache is bigger than the supported penciller cache size, should the vnode freeze until the ledger backlog is handled?

The text was updated successfully, but these errors were encountered:

martinsumner · 2023-03-02T21:29:09Z

It might be best to still leave this to the application, but offer an additional signal as well as ok|pause - instead supporting ok|pause|stall. In this case stall is an indication that the back-pressure is extreme, and a significant pause is required to prevent a vicious circle of degrading performance.

With Riak, the default backend pause should probably be increases, and perhaps 10 x the backend pause can then be used on stall.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Never stalling is an issue #402

Never stalling is an issue #402

martinsumner commented Mar 2, 2023 •

edited

Loading

martinsumner commented Mar 2, 2023

Never stalling is an issue #402

Never stalling is an issue #402

Comments

martinsumner commented Mar 2, 2023 • edited Loading

martinsumner commented Mar 2, 2023

martinsumner commented Mar 2, 2023 •

edited

Loading