Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polygon first sync crash #14208

Closed
wmitsuda opened this issue Mar 18, 2025 · 11 comments · Fixed by #14272
Closed

Polygon first sync crash #14208

wmitsuda opened this issue Mar 18, 2025 · 11 comments · Fixed by #14272
Labels

Comments

@wmitsuda
Copy link
Member

Today's main, during the first sync, after finishing download, it failed while accessing some another endpoint and stopped.

I'm not sure about what is the http://localhost:1317/status endpoint, at very least I think there should be some more useful error msg (which endpoint is that? which param to configure it?).

I guess it is bor-related? in any case I'm not sure if the behavior should be to simply stop the node, i.e., what if the endpoint is temporarily down.

INFO[03-17|21:10:57.152] [snapshots:download] Stat                blocks=68.71M indices=68.71M alloc=30.2GB sys=34.7GB
INFO[03-17|21:10:57.188] [snapshots:history] Stat                 blocks=68.70M txs=5.11B txNum2blockNum="2048=49361K,3072=65048K,3200=67421K,3264=68562K,3272=68704K" first_history_idx_in_db=0 last_comitment_blo
ck=68704665 last_comitment_tx_num=5112500000 alloc=30.2GB sys=34.7GB
INFO[03-17|21:10:57.200] [1/1 OtterSync] DONE                     in=22h26m25.884762s block=68713999
INFO[03-17|21:10:57.210] Timings (slower than 50ms)               OtterSync=22h26m25.884s alloc=30.2GB sys=34.7GB
INFO[03-17|21:10:57.233] [bridge] running bridge service component lastFetchedEventId=3034760 lastProcessedEventId=3034760 lastProcessedBlockNum=68713984 lastProcessedBlockTime=0
INFO[03-17|21:10:57.235] [sync] replaying post initial blocks for bridge store to fill gap with execution start=68713985 end=68713999 blocks=15
INFO[03-17|21:10:57.235] [sync] running sync component
INFO[03-17|21:11:05.236] [sync] sync service component stopped
EROR[03-17|21:11:05.237] polygon sync crashed - stopping node     err="pos sync failed: Get \"http://localhost:1317/status\": dial tcp 127.0.0.1:1317: connect: connection refused"
INFO[03-17|21:11:05.237] polygon sync goroutine terminated
INFO[03-17|21:11:05.239] [snapshots] stopping downloader          files=7380
INFO[03-17|21:11:05.240] Exiting...
INFO[03-17|21:11:05.240] [snapshots] closing torrents
INFO[03-17|21:11:05.241] HTTP endpoint closed                     url=[::]:12345
INFO[03-17|21:11:05.242] RPC server shutting down
INFO[03-17|21:11:05.520] [txpool] stopped
INFO[03-17|21:11:05.521] devp2p txn pool goroutine terminated
INFO[03-17|21:11:06.727] [snapshots] closing db
INFO[03-17|21:11:06.727] [snapshots] downloader stopped
EROR[03-17|21:11:06.796] background component error               err="pos sync failed: Get \"http://localhost:1317/status\": dial tcp 127.0.0.1:1317: connect: connection refused"
@wmitsuda
Copy link
Member Author

(reading the docs): it is actually the heimdall endpoint. in any case I think the improvement suggestions still apply for better UX.

@shohamc1
Copy link
Member

shohamc1 commented Mar 18, 2025

I believe this was an explicit design decision as previously this was a silent error and it would show up as a execution failure. This forces the user to check the status of their Heimdall endpoint and prevents execution failures.

@wmitsuda
Copy link
Member Author

wmitsuda commented Mar 18, 2025

I think the design decision is wrong then :)

I pointed my instance to the hosted polygon heimdall instance, and apparently if anything fails, the default action is to shutdown the node. See the following logs showing that after a network timeout it stopped Erigon.

Even if it was a local heimdall, it is an external component, Erigon being a headless service should not simply shutdown when a dependency fails, but instead retry it in the next sync loop (possibly with some delay between attempts).

INFO[03-18|15:51:33.188] [bridge] fetched new events periodic progress count=3 lastFetchedEventId=3040651 lastFetchedEventTime=2025-03-18T18:51:24Z
INFO[03-18|15:51:50.406] [4/6 Execution][agg] computing trie      progress=741.06k/1.21M alloc=19.1GB sys=45.6GB
INFO[03-18|15:52:10.407] [4/6 Execution][agg] computing trie      progress=761.51k/1.21M alloc=19.5GB sys=45.6GB
EROR[03-18|15:52:29.864] failed to update fork choice             latestValidHash=0x0000000000000000000000000000000000000000000000000000000000000000 err="context canceled"
EROR[03-18|15:52:29.865] [sync] waypoint execution err            lastCorrectTipNum=68824741 lastCorrectTipHash=0xe217c257b011c98c3bb500e276bc8421bf3d29fa89ce87ed57e12c84b3c99423 execErr="context canceled"
INFO[03-18|15:52:30.406] [4/6 Execution][agg] computing trie      progress=785.71k/1.21M alloc=19.9GB sys=45.6GB
INFO[03-18|15:52:32.866] [sync] sync service component stopped
EROR[03-18|15:52:32.866] polygon sync crashed - stopping node     err="pos sync bridge failed: Get \"https://heimdall-api.polygon.technology/clerk/event-record/list?from-id=3040652&to-time=1742323907&limit=50\":
 read tcp [2804:7f0:b403:fe7d:d9b2:5354:5ca4:7920]:64381->[2606:4700:4400::ac40:9292]:443: read: operation timed out"
INFO[03-18|15:52:32.866] polygon sync goroutine terminated
INFO[03-18|15:52:32.867] [snapshots] stopping downloader          files=7380
INFO[03-18|15:52:32.867] [snapshots] closing torrents
INFO[03-18|15:52:32.867] Exiting...
INFO[03-18|15:52:32.871] [txpool] stopped
INFO[03-18|15:52:32.871] devp2p txn pool goroutine terminated
INFO[03-18|15:52:32.871] HTTP endpoint closed                     url=[::]:12345
INFO[03-18|15:52:32.871] RPC server shutting down
INFO[03-18|15:52:32.881] [4/6 Execution] Done                     blk=68829828 blks=5088 blk/s=6.8 txs=403253 tx/s=539 gas/s=87.57M buf=3.7GB/512.0MB stepsInDB=0.00 step=3278.9 inMem=false alloc=20.1GB sys=45.6G
B
WARN[03-18|15:52:32.881] Cannot update chain head                 hash=0xe9a372a8bd81939899853d55792b2de413a71e51bf8bb1be467732460dcf1036 err="updateForkChoice: [4/6 Execution] hash sort failed: loadIntoTable :
stopped"
INFO[03-18|15:52:33.646] [snapshots] closing db
INFO[03-18|15:52:33.646] [snapshots] downloader stopped
EROR[03-18|15:52:35.253] background component error               err="pos sync bridge failed: Get \"https://heimdall-api.polygon.technology/clerk/event-record/list?from-id=3040652&to-time=1742323907&limit=50\":
 read tcp [2804:7f0:b403:fe7d:d9b2:5354:5ca4:7920]:64381->[2606:4700:4400::ac40:9292]:443: read: operation timed out"

@taratorio
Copy link
Member

taratorio commented Mar 19, 2025

It is not wrong :)

Erigon being a headless service should not simply shutdown when a dependency fails

Erigon can't do much without a working Heimdall. If your Heimdall is down you will get an execution root hash mismatch in every 16 blocks(every sprint start due to missing bridge events) and that is unrecoverable. If we do what you are suggesting we would end up with very long unwinds (due to root hash mismatches accumulating and new blocks coming in) and eventually Erigon will crash anyway because we have limited unwinds in E3.

instead retry it in the next sync loop

there is no sync loop here, the err is outside of the sync loop

The only pragmatic thing that can be done here is to add a "transient err" (a concept we already have) to the bridge scrapper for the 2nd error you run into about read: operation timed out for Get \"https://heimdall-api.polygon.technology/clerk/event-record. It looks like Shoham has done this already in #14220

The first err you reported Get \"http://localhost:1317/status\": dial tcp 127.0.0.1:1317: connect: connection refused" INFO[03-17|21:11:05.237] polygon sync goroutine terminated is because you didn't have a local heimdall instance running and/or you forgot to specify --bor.heimdall heimdall url (defaults to localhost:1317) to point it to a remote one. This is a user error really. Probably it can be improved by adding a few extra words in the err message saying "check your local heimdall is running or specify a healthy heimdall instance via --bor.heimdall flag".

@wmitsuda
Copy link
Member Author

alright, yeah, I don't really have a solution then, so feel free to ignore.

we just have to be aware that's not a natural behavior users may expect from a headless service, and the first err was particularly annoying because it only err after the first ottersync pass (start erigon -> wait a few days -> erigon stopped -> wat?), not a "fast feedback" loop.

@taratorio
Copy link
Member

yeah fair that is a good point, maybe this heimdall status check should be done before we start downloading snapshots (should be easy to move) and also it should have a more user friendly error message as you suggested

@taratorio
Copy link
Member

@wmitsuda but on your other point about making erigon live without a functioning heimdall - I agree in principle but in reality it's very very complex to achieve (I've tried and gave up) because the 2 are quite intertwined

@taratorio
Copy link
Member

maybe someone else can think of a way to achieve it at some point

@wmitsuda
Copy link
Member Author

fair 👍

@AlexeyAkhunov
Copy link
Contributor

Public endpoints

Mainnet | https://heimdall-api.polygon.technology
Amoy Testnet | https://heimdall-api-amoy.polygon.technology

AskAlexSharov pushed a commit that referenced this issue Mar 21, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@VBulikov VBulikov added the UX label Mar 21, 2025
@VBulikov
Copy link
Member

VBulikov added Imp2 as Importance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants