Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOS-EVM-Node: Does Not Retry After Ship Node Connection Issue #579

Closed
Johnaverse opened this issue Jun 8, 2023 · 3 comments · Fixed by eosnetworkfoundation/eos-evm-node#3
Assignees
Labels
enhancement New feature or request 👍 lgtm

Comments

@Johnaverse
Copy link
Contributor

Issue: EOS-EVM-Node does not retry after ship node connection and exit

Here is the logs

Jun 06 07:54:53 jungle4evm-arch41 silkworm[6699]:   WARN [06-06|07:54:53.967 UTC] Can't link new block #80'597'773 (id:04cdd30d04626f864de35171812ce9feb1b55f1c2bcbef171c03827e26633255,prev:>
Jun 06 07:54:53 jungle4evm-arch41 silkworm[6699]:   WARN [06-06|07:54:53.967 UTC] Fork at Block #80'597'772 (id:04cdd30c28aca266dc41714d9c0039ffa0ef0f8c0a4fac6796f6cde28d0b4c4f,prev:04cdd30>
Jun 06 07:54:53 jungle4evm-arch41 silkworm[6699]:   WARN [06-06|07:54:53.967 UTC] Removing forked native block #80'597'773 (id:04cdd30d7dfcfaa7243bb6911e7be5cb23067d5efc04d805928b94b4b38031>
Jun 06 07:54:53 jungle4evm-arch41 silkworm[6699]:   WARN [06-06|07:54:53.967 UTC] Reset upper bound for EVM Block #6'501'275, txs:0, hash:7deed7d547c8256c1ca6c83261345eb9bd4b4a87ae2465d6028>
Jun 06 07:57:58 jungle4evm-arch41 consul[6636]: 2023-06-06T07:57:58.418Z [INFO]  agent: Synced check: check=service:jungle4evm-silkworm
Jun 06 08:00:06 jungle4evm-arch41 silkworm[6699]:   CRIT [06-06|08:00:06.285 UTC] SHiP read failed : End of file
Jun 06 08:00:06 jungle4evm-arch41 silkworm[6699]:  ERROR [06-06|08:00:06.831 UTC] [2/10 BlockHashes]                 function=forward exception=kAborted
Jun 06 08:00:06 jungle4evm-arch41 silkworm[6699]:  ERROR [06-06|08:00:06.831 UTC] [2/10 BlockHashes]                 op=Forward returned=kAborted
Jun 06 08:00:06 jungle4evm-arch41 silkworm[6699]:  ERROR [06-06|08:00:06.831 UTC] SyncLoop                           function=work exception=kAborted

Description: I am facing an issue with the EOS-EVM-Node where it does not retry after encountering a ship node connection issue. Assume that few connection issues must occur when program running. This behaviour poses a problem for the stability and reliability of the eos-evm-node.

Steps to Reproduce:

Start the EOS-EVM-Node.
Simulate a ship node connection issue by disconnecting the ship node from the network.
Observe the EOS-EVM-Node's behavior.
Expected Behavior: The EOS-EVM-Node should attempt to reconnect to the ship node after detecting the connection issue.

Actual Behavior: The EOS-EVM-Node does not make any retry attempts after encountering a ship node connection issue. It remains in a disconnected state, preventing the synchronization of data and disrupting the blockchain network's operations.

Impact: This issue significantly affects the availability and stability of the eos-evm-node and, consequently, the EOS EVM node operation. Without proper retry mechanisms, the node's failure to connect to the ship node and stop service.

@taokayan
Copy link
Contributor

taokayan commented Jun 8, 2023

If eos-evm-node exits gracefully after SHIP is disconnected, the solution is to have a batch/python script to find out the next available SHIP endpoint and restart eos-evm-node.

In a high-available setup, you can have

  • leap(nodeos) node 1 with SHIP in VM 1
  • leap node 2 with SHIP in VM2
  • eos-evm-node 1 in VM3, managed by a script to automatically select the available leap node to connect
  • eos-evm-node 2 in VM4, managed by a script to automatically select the available leap node to connect

@yarkinwho
Copy link
Contributor

After some discussion, the behavior should be:
1 The code will retry connection even if reconnection itself failed.
2 It will retry a configurable number of times before exit (setting the number to 0 will effectively make it current behavior)
3 Configurable delay between retry, default to 10s
4 cache LIB for each block so at least for reconnection, we have a reliable LIB to restart from.

@yarkinwho
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment