Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try reconnecting to SHiP again when read fails. #614

Closed
wants to merge 2 commits into from

Conversation

yarkinwho
Copy link
Contributor

Fix #579 #583

Some behavior may subject to discussion:
1 Only retry if failed to read. that is, failed to send requests or failed to generate block will not trigger retry.
Logic here: majority of the recoverable (by reconnect) fails happen here and the reconnect logic is simple there. Supporting reconnect during other steps would be much more complex and the gain is marginal.

2 When reconnect to SHiP, start from 250s before canonical HEAD (if possible)
Logic here:
a This approach is simple. No moving parts at all.
b It is well tested that the code can start from a certain early block.
c The block 250s before canonical HEAD must have been irreversible already.

(250s maybe too much, we can change that)

// Clearly this approach will introduce overhaad, but since this piece of code
// will only execute during recovery, it should be fine.
if (head_header->number > 250) {
start_from -= 500;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is trivial to keep track of LIB which would be better than just guessing at a block number. See block.last_irreversible.block_num.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means we need a set of functions to query single block etc.

And there's a corner case:
FORK ------ A (origin HEAD)
| -LIB--------A' (new block at same height)

In the extreme case that LIB already pass the fork block when we try to restart, LIB is not enough to recover.

stream->binary(true);
stream->read_message_max(0x1ull << 36);
connect_stream();
initial_read();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would refactor this and the code from init() into a common method.

// Any other error will still result in exit.
SILK_INFO << "Trying to recover from SHiP read failure.";
// Wait for a while before doing anything in case we hit some network jam.
std::this_thread::sleep_for(std::chrono::seconds(3));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 seconds seems like a rather long time.

abi = load_abi(eosio::json_token_stream{(char*)buff.data().data()});
auto end = buff.prepare(1);
((char *)end.data())[0] = '\0';
buff.commit(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems rather odd that it would not always contain the terminating character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not sure whether we should fix it like this or not.

I don't think boost websocket scream will write a zero there. So maybe we should fix json_token_stream.

But I am not sure if it's a good idea to touch cdt for this issue..

@arhag arhag linked an issue Jul 4, 2023 that may be closed by this pull request
@yarkinwho
Copy link
Contributor Author

After some discussion, we will have some major logic changes, so close this PR.

@yarkinwho yarkinwho closed this Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants