capi: Fix batch size control #2308

JacekGlen · 2024-09-05T13:23:04Z

The problem with the current approach is that it catches db::Buffer::MemoryLimitError too late, when the block has already been popped from the block_buffer. Then in the next iteration we would get the next block and the assert SILKWORM_ASSERT(block->header.number == block_number) would fail. For this to work we need to re-insert the failed block in the exception handler.

I think the better approach is to control the batch size with the state_buffer.current_batch_state_size() <= max_batch_size check. Additionally, using exceptions for control flow is generally discouraged.

I changed the unit test to cover for this scenario. Also I have tested the change by synchronizing the full Sepolia chain data.

battlmonstr

This was done like you suggest before, but was changed in order to fix OOM crashes on low RAM VMs. The current_batch_state_size check doesn't guarantee that the next iteration execution overflows memory. When it fails to execute due to a potential overflow, it produces an error. The condition for that happens multiple layers inside, so I've decided to use an exception and not result/error codes.

I think it is likely my fault that the logic is broken here. I agree that on MemoryLimitError it should reinsert the block back to the block_buffer, because the block was not executed, and needs to be executed again (for block_buffer.pop_back to return it again).

For silkworm_execute_blocks_ephemeral this is not a problem, because prefetched_blocks.pop_front is done only for successful executions.

Alternatively consider to align silkworm_execute_blocks_perpetual logic to do the same and pop only on success.

JacekGlen · 2024-09-09T13:06:23Z

This was done like you suggest before, but was changed in order to fix OOM crashes on low RAM VMs. The current_batch_state_size check doesn't guarantee that the next iteration execution overflows memory. When it fails to execute due to a potential overflow, it produces an error. The condition for that happens multiple layers inside, so I've decided to use an exception and not result/error codes.

I think it is likely my fault that the logic is broken here. I agree that on MemoryLimitError it should reinsert the block back to the block_buffer, because the block was not executed, and needs to be executed again (for block_buffer.pop_back to return it again).

For silkworm_execute_blocks_ephemeral this is not a problem, because prefetched_blocks.pop_front is done only for successful executions.

Alternatively consider to align silkworm_execute_blocks_perpetual logic to do the same and pop only on success.

I am split on this, so let's pick @canepat brain. To summarize:

The current logic is:

Execute blocks until MemoryLimit exception
Re-insert the last block
Commit to the db

I see two problems with this: we use exceptions to control program flow and some blocks must be executed twice. Even if we decide to pop the block only on success, we still face these two issues.

The original approach was:

Execute blocks until the current batch size exceeds MaxBatchSize
Commit to the db

The problem is that we find out if the block exceeds the limit only after the execution. In some cases, this can cause out-of-memory exceptions. There is no easy way to detect if a block exceeds the limit before the execution.

Some potential fixes:

Go deeper into the execution and split it into execute and flush. So we can potentially commit to the db before flushing to the buffer
Using the original approach, lower the MaxBatchSize by some arbitrary number (e.g. 10%, or 1k) so executing a block will never raise and exception (but batch sizes will be different to Erigon)
Ignore the issue as this is Erigon 2 specific

battlmonstr · 2024-09-09T17:07:59Z

@JacekGlen

Yes, the main reason of not doing it the old way (apriori estimation) is that the estimation function must know how many things are going to be updated/inserted, and this only becomes known AFTER it executes.

You are saying that this won't be a problem for erigon 3. In this case in my opinion we should not spend too much time on this, and do minimal changes to fix the logic.

So I'm trying to understand your points better and evaluate in terms of work scope:

how hard is it to re-insert the last block into the block_buffer codewise?
when you say: "some blocks must be executed twice" - how big of a penalty that is overall?
when you say: "split it into execute and flush" - what exactly do you mean? I remember that I already did this where I could, but not to 100%, because of some code design problems related to how core/node was split. In case we do split it up to here, wouldn't we still need to do some bookkeeping about the need to execute or not execute when the exception happens?

Regarding the exception usage, we have considered using an extended enum for embedding this error into the execution result, but at the end decided not to do it because it lead to extra complexity. Nevertheless, it wouldn't change the need to reinsert the block. If it bothers you to call break inside catch, one suggestion could be to have a boolean flag to set inside catch and check it outside the catch. With that said, I understand that the problem is not exactly about exception/error, but that the code has become more complicated than it was.

battlmonstr

🙏

battlmonstr · 2024-09-11T08:36:03Z

silkworm_capi_test hangs on CI though

JacekGlen · 2024-09-22T09:14:50Z

silkworm_capi_test hangs on CI though

@battlmonstr issue fixed, please review again

battlmonstr

👍

Fix issue with buffer reaching batch size

93fcd6e

JacekGlen requested review from battlmonstr and canepat September 5, 2024 13:23

make fmt

5e56e26

battlmonstr requested changes Sep 5, 2024

View reviewed changes

JacekGlen and others added 2 commits September 10, 2024 15:46

Use MemoryLimitError exception to control batch logic

50af30d

make fmt

6cd7cac

battlmonstr approved these changes Sep 11, 2024

View reviewed changes

JacekGlen and others added 4 commits September 18, 2024 16:24

Merge branch 'master' into capi-execute-blocks-perpetual-buffer-fix

8f7a42c

Merge branch 'master' into capi-execute-blocks-perpetual-buffer-fix

d60b417

Detect insufficient memory limit

a81434c

make fmt

66fe693

JacekGlen requested a review from battlmonstr September 22, 2024 09:14

battlmonstr approved these changes Sep 23, 2024

View reviewed changes

battlmonstr merged commit d65aaa4 into master Sep 23, 2024
5 checks passed

battlmonstr deleted the capi-execute-blocks-perpetual-buffer-fix branch September 23, 2024 08:59

canepat mentioned this pull request Sep 30, 2024

silkworm_capi_test: failure on macOS after decreasing execution batch size #2389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

capi: Fix batch size control #2308

capi: Fix batch size control #2308

JacekGlen commented Sep 5, 2024

battlmonstr left a comment •

edited

Loading

JacekGlen commented Sep 9, 2024 •

edited

Loading

battlmonstr commented Sep 9, 2024

battlmonstr left a comment

battlmonstr commented Sep 11, 2024

JacekGlen commented Sep 22, 2024

battlmonstr left a comment

capi: Fix batch size control #2308

capi: Fix batch size control #2308

Conversation

JacekGlen commented Sep 5, 2024

battlmonstr left a comment • edited Loading

Choose a reason for hiding this comment

JacekGlen commented Sep 9, 2024 • edited Loading

battlmonstr commented Sep 9, 2024

battlmonstr left a comment

Choose a reason for hiding this comment

battlmonstr commented Sep 11, 2024

JacekGlen commented Sep 22, 2024

battlmonstr left a comment

Choose a reason for hiding this comment

battlmonstr left a comment •

edited

Loading

JacekGlen commented Sep 9, 2024 •

edited

Loading