Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resolve indexer infinite loop #82

Merged
merged 2 commits into from
Oct 22, 2024

Conversation

pirtleshell
Copy link
Member

on very slow drives or when run with limited resources, a node can have a delay between the block existing & being saved and the block_results getting saved. if the block exists, but the block_results do not, an infinite loop occurs. the indexer will repeatedly request the block and block_results until they both exist. the lack of delay can further constrain the node's resources and result in many calls for block_results before they are committed.

this commit updates the condition for waiting to include whenever an error occurred during indexing. if the indexer fails to find the block_results it will bombard the node with requests for it without backing off. this change causes errors to trigger a wait. after waiting for either a new block or for the timeout, the block results are more likely to exist.

on very slow drives or when run with limited resources, a node can have
a delay between the block existing & being saved and the block_results
getting saved. if the block exists, but the block_results do not, an
infinite loop occurs. the indexer will repeatedly request the block and
block_results until they both exist. the lack of delay can further
constrain the node's resources and result in many calls for block_results
before they are committed.

this commit updates the condition for waiting to include whenever an error
occurred during indexing. if the indexer fails to find the block_results
it will bombard the node with requests for it without backing off. this
change causes errors to trigger a wait. after waiting for either a new
block or for the timeout, the block results are more likely to exist.
@pirtleshell
Copy link
Member Author

i discovered this bug when attempting to sync a node that had a very very slow drive (unwarmed & from a snapshot). after the chain service started, the node output 1000s of errors like

ERR failed to fetch block result err="could not find results for height #12268540" height=12268541 indexer=evm module=server

i first confirmed my understanding... stopping the node & starting it again, i saw block 12268540 sync, and then 1000s of the same error, but for block 12268541 occurred. this means that after a little time, the block results manage to get committed. the problem is just that the indexer assumes the bock results always exist if the block does. however, cometbft does not store the block and its results at the same time. it saves the new block from peers, cometbft recognizes that as a new height, then the app processes the block and passes back the results to be saved.

when the drive is slow (the save of block results has high latency) or the node has limited resources (the app's processing has high latency), there is a window of time in which the block is saved but the results are not.

before this commit, the evm indexer would inundate the node with requests that fail (further limiting the resources the app has to process the block & for cometbft to save the results). now, if an error occurs during indexing, the indexer will wait either until a new block is received (the previous block's results are guaranteed to be saved) or after a timeout (1 minutes).

the already-existing wait-for-new-blocks loop is used for the error condition


i installed this on the box i was experiencing the problem on. instead of infinitely looping the erroring queries, it failed once, waited, and then continued successfully from that point forward

Copy link
Member

@nddeluca nddeluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this fix! Could we upstream to main branch as well?

@pirtleshell pirtleshell marked this pull request as ready for review October 22, 2024 20:03
@pirtleshell pirtleshell merged commit e3cbae3 into kava/release/v0.26.x Oct 22, 2024
15 checks passed
@pirtleshell pirtleshell deleted the rp-indexer-infinte-loop branch October 22, 2024 20:23
pirtleshell added a commit that referenced this pull request Oct 22, 2024
on very slow drives or when run with limited resources, a node can have
a delay between the block existing & being saved and the block_results
getting saved. if the block exists, but the block_results do not, an
infinite loop occurs. the indexer will repeatedly request the block and
block_results until they both exist. the lack of delay can further
constrain the node's resources and result in many calls for block_results
before they are committed.

this commit updates the condition for waiting to include whenever an error
occurred during indexing. if the indexer fails to find the block_results
it will bombard the node with requests for it without backing off. this
change causes errors to trigger a wait. after waiting for either a new
block or for the timeout, the block results are more likely to exist.
nddeluca pushed a commit that referenced this pull request Oct 23, 2024
on very slow drives or when run with limited resources, a node can have
a delay between the block existing & being saved and the block_results
getting saved. if the block exists, but the block_results do not, an
infinite loop occurs. the indexer will repeatedly request the block and
block_results until they both exist. the lack of delay can further
constrain the node's resources and result in many calls for block_results
before they are committed.

this commit updates the condition for waiting to include whenever an error
occurred during indexing. if the indexer fails to find the block_results
it will bombard the node with requests for it without backing off. this
change causes errors to trigger a wait. after waiting for either a new
block or for the timeout, the block results are more likely to exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants