Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capi: Fix batch size control #2308

Merged
merged 8 commits into from
Sep 23, 2024

Conversation

JacekGlen
Copy link
Member

The problem with the current approach is that it catches db::Buffer::MemoryLimitError too late, when the block has already been popped from the block_buffer. Then in the next iteration we would get the next block and the assert SILKWORM_ASSERT(block->header.number == block_number) would fail. For this to work we need to re-insert the failed block in the exception handler.

I think the better approach is to control the batch size with the state_buffer.current_batch_state_size() <= max_batch_size check. Additionally, using exceptions for control flow is generally discouraged.

I changed the unit test to cover for this scenario. Also I have tested the change by synchronizing the full Sepolia chain data.

Copy link
Contributor

@battlmonstr battlmonstr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done like you suggest before, but was changed in order to fix OOM crashes on low RAM VMs. The current_batch_state_size check doesn't guarantee that the next iteration execution overflows memory. When it fails to execute due to a potential overflow, it produces an error. The condition for that happens multiple layers inside, so I've decided to use an exception and not result/error codes.

I think it is likely my fault that the logic is broken here. I agree that on MemoryLimitError it should reinsert the block back to the block_buffer, because the block was not executed, and needs to be executed again (for block_buffer.pop_back to return it again).

For silkworm_execute_blocks_ephemeral this is not a problem, because prefetched_blocks.pop_front is done only for successful executions.

Alternatively consider to align silkworm_execute_blocks_perpetual logic to do the same and pop only on success.

@JacekGlen
Copy link
Member Author

JacekGlen commented Sep 9, 2024

This was done like you suggest before, but was changed in order to fix OOM crashes on low RAM VMs. The current_batch_state_size check doesn't guarantee that the next iteration execution overflows memory. When it fails to execute due to a potential overflow, it produces an error. The condition for that happens multiple layers inside, so I've decided to use an exception and not result/error codes.

I think it is likely my fault that the logic is broken here. I agree that on MemoryLimitError it should reinsert the block back to the block_buffer, because the block was not executed, and needs to be executed again (for block_buffer.pop_back to return it again).

For silkworm_execute_blocks_ephemeral this is not a problem, because prefetched_blocks.pop_front is done only for successful executions.

Alternatively consider to align silkworm_execute_blocks_perpetual logic to do the same and pop only on success.

I am split on this, so let's pick @canepat brain. To summarize:

The current logic is:

  1. Execute blocks until MemoryLimit exception
  2. Re-insert the last block
  3. Commit to the db

I see two problems with this: we use exceptions to control program flow and some blocks must be executed twice. Even if we decide to pop the block only on success, we still face these two issues.

The original approach was:

  1. Execute blocks until the current batch size exceeds MaxBatchSize
  2. Commit to the db

The problem is that we find out if the block exceeds the limit only after the execution. In some cases, this can cause out-of-memory exceptions. There is no easy way to detect if a block exceeds the limit before the execution.

Some potential fixes:

  • Go deeper into the execution and split it into execute and flush. So we can potentially commit to the db before flushing to the buffer
  • Using the original approach, lower the MaxBatchSize by some arbitrary number (e.g. 10%, or 1k) so executing a block will never raise and exception (but batch sizes will be different to Erigon)
  • Ignore the issue as this is Erigon 2 specific

@battlmonstr
Copy link
Contributor

@JacekGlen

Yes, the main reason of not doing it the old way (apriori estimation) is that the estimation function must know how many things are going to be updated/inserted, and this only becomes known AFTER it executes.

You are saying that this won't be a problem for erigon 3. In this case in my opinion we should not spend too much time on this, and do minimal changes to fix the logic.

So I'm trying to understand your points better and evaluate in terms of work scope:

  1. how hard is it to re-insert the last block into the block_buffer codewise?
  2. when you say: "some blocks must be executed twice" - how big of a penalty that is overall?
  3. when you say: "split it into execute and flush" - what exactly do you mean? I remember that I already did this where I could, but not to 100%, because of some code design problems related to how core/node was split. In case we do split it up to here, wouldn't we still need to do some bookkeeping about the need to execute or not execute when the exception happens?

Regarding the exception usage, we have considered using an extended enum for embedding this error into the execution result, but at the end decided not to do it because it lead to extra complexity. Nevertheless, it wouldn't change the need to reinsert the block. If it bothers you to call break inside catch, one suggestion could be to have a boolean flag to set inside catch and check it outside the catch. With that said, I understand that the problem is not exactly about exception/error, but that the code has become more complicated than it was.

Copy link
Contributor

@battlmonstr battlmonstr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

@battlmonstr
Copy link
Contributor

silkworm_capi_test hangs on CI though

@JacekGlen
Copy link
Member Author

silkworm_capi_test hangs on CI though

@battlmonstr issue fixed, please review again

Copy link
Contributor

@battlmonstr battlmonstr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@battlmonstr battlmonstr merged commit d65aaa4 into master Sep 23, 2024
5 checks passed
@battlmonstr battlmonstr deleted the capi-execute-blocks-perpetual-buffer-fix branch September 23, 2024 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants