-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polygon first sync crash #14208
Comments
(reading the docs): it is actually the heimdall endpoint. in any case I think the improvement suggestions still apply for better UX. |
I believe this was an explicit design decision as previously this was a silent error and it would show up as a execution failure. This forces the user to check the status of their Heimdall endpoint and prevents execution failures. |
I think the design decision is wrong then :) I pointed my instance to the hosted polygon heimdall instance, and apparently if anything fails, the default action is to shutdown the node. See the following logs showing that after a network timeout it stopped Erigon. Even if it was a local heimdall, it is an external component, Erigon being a headless service should not simply shutdown when a dependency fails, but instead retry it in the next sync loop (possibly with some delay between attempts).
|
It is not wrong :)
Erigon can't do much without a working Heimdall. If your Heimdall is down you will get an execution root hash mismatch in every 16 blocks(every sprint start due to missing bridge events) and that is unrecoverable. If we do what you are suggesting we would end up with very long unwinds (due to root hash mismatches accumulating and new blocks coming in) and eventually Erigon will crash anyway because we have limited unwinds in E3.
there is no sync loop here, the err is outside of the sync loop The only pragmatic thing that can be done here is to add a "transient err" (a concept we already have) to the bridge scrapper for the 2nd error you run into about The first err you reported |
alright, yeah, I don't really have a solution then, so feel free to ignore. we just have to be aware that's not a natural behavior users may expect from a headless service, and the first err was particularly annoying because it only err after the first ottersync pass (start erigon -> wait a few days -> erigon stopped -> wat?), not a "fast feedback" loop. |
yeah fair that is a good point, maybe this heimdall status check should be done before we start downloading snapshots (should be easy to move) and also it should have a more user friendly error message as you suggested |
@wmitsuda but on your other point about making erigon live without a functioning heimdall - I agree in principle but in reality it's very very complex to achieve (I've tried and gave up) because the 2 are quite intertwined |
maybe someone else can think of a way to achieve it at some point |
fair 👍 |
Public endpoints Mainnet | https://heimdall-api.polygon.technology |
VBulikov added Imp2 as Importance |
Today's main, during the first sync, after finishing download, it failed while accessing some another endpoint and stopped.
I'm not sure about what is the http://localhost:1317/status endpoint, at very least I think there should be some more useful error msg (which endpoint is that? which param to configure it?).
I guess it is bor-related? in any case I'm not sure if the behavior should be to simply stop the node, i.e., what if the endpoint is temporarily down.
The text was updated successfully, but these errors were encountered: