Try reconnecting to SHiP again when read fails. #614

yarkinwho · 2023-07-03T17:56:52Z

Some behavior may subject to discussion:
1 Only retry if failed to read. that is, failed to send requests or failed to generate block will not trigger retry.
Logic here: majority of the recoverable (by reconnect) fails happen here and the reconnect logic is simple there. Supporting reconnect during other steps would be much more complex and the gain is marginal.

2 When reconnect to SHiP, start from 250s before canonical HEAD (if possible)
Logic here:
a This approach is simple. No moving parts at all.
b It is well tested that the code can start from a certain early block.
c The block 250s before canonical HEAD must have been irreversible already.

(250s maybe too much, we can change that)

heifner · 2023-07-03T18:18:52Z

cmd/ship_receiver_plugin.cpp

+            // Clearly this approach will introduce overhaad, but since this piece of code 
+            // will only execute during recovery, it should be fine.
+            if (head_header->number > 250) {
+               start_from -= 500;


It is trivial to keep track of LIB which would be better than just guessing at a block number. See block.last_irreversible.block_num.

That means we need a set of functions to query single block etc.

And there's a corner case:
FORK ------ A (origin HEAD)
| -LIB--------A' (new block at same height)

In the extreme case that LIB already pass the fork block when we try to restart, LIB is not enough to recover.

heifner · 2023-07-03T18:22:38Z

cmd/ship_receiver_plugin.cpp

+               stream->binary(true);
+               stream->read_message_max(0x1ull << 36);
+               connect_stream();
+               initial_read();


I would refactor this and the code from init() into a common method.

heifner · 2023-07-03T18:22:55Z

cmd/ship_receiver_plugin.cpp

+               // Any other error will still result in exit.
+               SILK_INFO << "Trying to recover from SHiP read failure.";
+               // Wait for a while before doing anything in case we hit some network jam.
+               std::this_thread::sleep_for(std::chrono::seconds(3));


3 seconds seems like a rather long time.

heifner · 2023-07-03T18:31:34Z

cmd/ship_receiver_plugin.cpp

-         abi = load_abi(eosio::json_token_stream{(char*)buff.data().data()});
+         auto end = buff.prepare(1);
+         ((char *)end.data())[0] = '\0';
+         buff.commit(1);


This seems rather odd that it would not always contain the terminating character.

Yeah, I am not sure whether we should fix it like this or not.

I don't think boost websocket scream will write a zero there. So maybe we should fix json_token_stream.

But I am not sure if it's a good idea to touch cdt for this issue..

yarkinwho · 2023-07-04T03:15:17Z

After some discussion, we will have some major logic changes, so close this PR.

yarkinwho added 2 commits July 3, 2023 15:05

Try reconnecting SHiP if read failed.

79f966d

Fix missing initial_read and issues in it

afbbd30

heifner requested changes Jul 3, 2023

View reviewed changes

arhag linked an issue Jul 4, 2023 that may be closed by this pull request

Handle forks properly on ship disconnect/reconnect #583

Closed

yarkinwho closed this Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try reconnecting to SHiP again when read fails. #614

Try reconnecting to SHiP again when read fails. #614

yarkinwho commented Jul 3, 2023

heifner Jul 3, 2023

yarkinwho Jul 4, 2023

heifner Jul 3, 2023

heifner Jul 3, 2023

heifner Jul 3, 2023

yarkinwho Jul 4, 2023

yarkinwho commented Jul 4, 2023

Try reconnecting to SHiP again when read fails. #614

Try reconnecting to SHiP again when read fails. #614

Conversation

yarkinwho commented Jul 3, 2023

heifner Jul 3, 2023

Choose a reason for hiding this comment

yarkinwho Jul 4, 2023

Choose a reason for hiding this comment

heifner Jul 3, 2023

Choose a reason for hiding this comment

heifner Jul 3, 2023

Choose a reason for hiding this comment

heifner Jul 3, 2023

Choose a reason for hiding this comment

yarkinwho Jul 4, 2023

Choose a reason for hiding this comment

yarkinwho commented Jul 4, 2023