Polygon first sync crash #14208

wmitsuda · 2025-03-18T07:58:04Z

Today's main, during the first sync, after finishing download, it failed while accessing some another endpoint and stopped.

I'm not sure about what is the http://localhost:1317/status endpoint, at very least I think there should be some more useful error msg (which endpoint is that? which param to configure it?).

I guess it is bor-related? in any case I'm not sure if the behavior should be to simply stop the node, i.e., what if the endpoint is temporarily down.

INFO[03-17|21:10:57.152] [snapshots:download] Stat                blocks=68.71M indices=68.71M alloc=30.2GB sys=34.7GB
INFO[03-17|21:10:57.188] [snapshots:history] Stat                 blocks=68.70M txs=5.11B txNum2blockNum="2048=49361K,3072=65048K,3200=67421K,3264=68562K,3272=68704K" first_history_idx_in_db=0 last_comitment_blo
ck=68704665 last_comitment_tx_num=5112500000 alloc=30.2GB sys=34.7GB
INFO[03-17|21:10:57.200] [1/1 OtterSync] DONE                     in=22h26m25.884762s block=68713999
INFO[03-17|21:10:57.210] Timings (slower than 50ms)               OtterSync=22h26m25.884s alloc=30.2GB sys=34.7GB
INFO[03-17|21:10:57.233] [bridge] running bridge service component lastFetchedEventId=3034760 lastProcessedEventId=3034760 lastProcessedBlockNum=68713984 lastProcessedBlockTime=0
INFO[03-17|21:10:57.235] [sync] replaying post initial blocks for bridge store to fill gap with execution start=68713985 end=68713999 blocks=15
INFO[03-17|21:10:57.235] [sync] running sync component
INFO[03-17|21:11:05.236] [sync] sync service component stopped
EROR[03-17|21:11:05.237] polygon sync crashed - stopping node     err="pos sync failed: Get \"http://localhost:1317/status\": dial tcp 127.0.0.1:1317: connect: connection refused"
INFO[03-17|21:11:05.237] polygon sync goroutine terminated
INFO[03-17|21:11:05.239] [snapshots] stopping downloader          files=7380
INFO[03-17|21:11:05.240] Exiting...
INFO[03-17|21:11:05.240] [snapshots] closing torrents
INFO[03-17|21:11:05.241] HTTP endpoint closed                     url=[::]:12345
INFO[03-17|21:11:05.242] RPC server shutting down
INFO[03-17|21:11:05.520] [txpool] stopped
INFO[03-17|21:11:05.521] devp2p txn pool goroutine terminated
INFO[03-17|21:11:06.727] [snapshots] closing db
INFO[03-17|21:11:06.727] [snapshots] downloader stopped
EROR[03-17|21:11:06.796] background component error               err="pos sync failed: Get \"http://localhost:1317/status\": dial tcp 127.0.0.1:1317: connect: connection refused"

The text was updated successfully, but these errors were encountered:

wmitsuda · 2025-03-18T07:59:41Z

(reading the docs): it is actually the heimdall endpoint. in any case I think the improvement suggestions still apply for better UX.

shohamc1 · 2025-03-18T09:12:09Z

I believe this was an explicit design decision as previously this was a silent error and it would show up as a execution failure. This forces the user to check the status of their Heimdall endpoint and prevents execution failures.

wmitsuda · 2025-03-18T19:30:01Z

I think the design decision is wrong then :)

I pointed my instance to the hosted polygon heimdall instance, and apparently if anything fails, the default action is to shutdown the node. See the following logs showing that after a network timeout it stopped Erigon.

Even if it was a local heimdall, it is an external component, Erigon being a headless service should not simply shutdown when a dependency fails, but instead retry it in the next sync loop (possibly with some delay between attempts).

INFO[03-18|15:51:33.188] [bridge] fetched new events periodic progress count=3 lastFetchedEventId=3040651 lastFetchedEventTime=2025-03-18T18:51:24Z
INFO[03-18|15:51:50.406] [4/6 Execution][agg] computing trie      progress=741.06k/1.21M alloc=19.1GB sys=45.6GB
INFO[03-18|15:52:10.407] [4/6 Execution][agg] computing trie      progress=761.51k/1.21M alloc=19.5GB sys=45.6GB
EROR[03-18|15:52:29.864] failed to update fork choice             latestValidHash=0x0000000000000000000000000000000000000000000000000000000000000000 err="context canceled"
EROR[03-18|15:52:29.865] [sync] waypoint execution err            lastCorrectTipNum=68824741 lastCorrectTipHash=0xe217c257b011c98c3bb500e276bc8421bf3d29fa89ce87ed57e12c84b3c99423 execErr="context canceled"
INFO[03-18|15:52:30.406] [4/6 Execution][agg] computing trie      progress=785.71k/1.21M alloc=19.9GB sys=45.6GB
INFO[03-18|15:52:32.866] [sync] sync service component stopped
EROR[03-18|15:52:32.866] polygon sync crashed - stopping node     err="pos sync bridge failed: Get \"https://heimdall-api.polygon.technology/clerk/event-record/list?from-id=3040652&to-time=1742323907&limit=50\":
 read tcp [2804:7f0:b403:fe7d:d9b2:5354:5ca4:7920]:64381->[2606:4700:4400::ac40:9292]:443: read: operation timed out"
INFO[03-18|15:52:32.866] polygon sync goroutine terminated
INFO[03-18|15:52:32.867] [snapshots] stopping downloader          files=7380
INFO[03-18|15:52:32.867] [snapshots] closing torrents
INFO[03-18|15:52:32.867] Exiting...
INFO[03-18|15:52:32.871] [txpool] stopped
INFO[03-18|15:52:32.871] devp2p txn pool goroutine terminated
INFO[03-18|15:52:32.871] HTTP endpoint closed                     url=[::]:12345
INFO[03-18|15:52:32.871] RPC server shutting down
INFO[03-18|15:52:32.881] [4/6 Execution] Done                     blk=68829828 blks=5088 blk/s=6.8 txs=403253 tx/s=539 gas/s=87.57M buf=3.7GB/512.0MB stepsInDB=0.00 step=3278.9 inMem=false alloc=20.1GB sys=45.6G
B
WARN[03-18|15:52:32.881] Cannot update chain head                 hash=0xe9a372a8bd81939899853d55792b2de413a71e51bf8bb1be467732460dcf1036 err="updateForkChoice: [4/6 Execution] hash sort failed: loadIntoTable :
stopped"
INFO[03-18|15:52:33.646] [snapshots] closing db
INFO[03-18|15:52:33.646] [snapshots] downloader stopped
EROR[03-18|15:52:35.253] background component error               err="pos sync bridge failed: Get \"https://heimdall-api.polygon.technology/clerk/event-record/list?from-id=3040652&to-time=1742323907&limit=50\":
 read tcp [2804:7f0:b403:fe7d:d9b2:5354:5ca4:7920]:64381->[2606:4700:4400::ac40:9292]:443: read: operation timed out"

taratorio · 2025-03-19T08:24:09Z

It is not wrong :)

Erigon being a headless service should not simply shutdown when a dependency fails

Erigon can't do much without a working Heimdall. If your Heimdall is down you will get an execution root hash mismatch in every 16 blocks(every sprint start due to missing bridge events) and that is unrecoverable. If we do what you are suggesting we would end up with very long unwinds (due to root hash mismatches accumulating and new blocks coming in) and eventually Erigon will crash anyway because we have limited unwinds in E3.

instead retry it in the next sync loop

there is no sync loop here, the err is outside of the sync loop

The only pragmatic thing that can be done here is to add a "transient err" (a concept we already have) to the bridge scrapper for the 2nd error you run into about read: operation timed out for Get \"https://heimdall-api.polygon.technology/clerk/event-record. It looks like Shoham has done this already in #14220

The first err you reported Get \"http://localhost:1317/status\": dial tcp 127.0.0.1:1317: connect: connection refused" INFO[03-17|21:11:05.237] polygon sync goroutine terminated is because you didn't have a local heimdall instance running and/or you forgot to specify --bor.heimdall heimdall url (defaults to localhost:1317) to point it to a remote one. This is a user error really. Probably it can be improved by adding a few extra words in the err message saying "check your local heimdall is running or specify a healthy heimdall instance via --bor.heimdall flag".

wmitsuda · 2025-03-19T11:19:50Z

alright, yeah, I don't really have a solution then, so feel free to ignore.

we just have to be aware that's not a natural behavior users may expect from a headless service, and the first err was particularly annoying because it only err after the first ottersync pass (start erigon -> wait a few days -> erigon stopped -> wat?), not a "fast feedback" loop.

taratorio · 2025-03-19T11:25:49Z

yeah fair that is a good point, maybe this heimdall status check should be done before we start downloading snapshots (should be easy to move) and also it should have a more user friendly error message as you suggested

taratorio · 2025-03-19T11:27:36Z

@wmitsuda but on your other point about making erigon live without a functioning heimdall - I agree in principle but in reality it's very very complex to achieve (I've tried and gave up) because the 2 are quite intertwined

taratorio · 2025-03-19T11:28:22Z

maybe someone else can think of a way to achieve it at some point

wmitsuda · 2025-03-19T11:28:31Z

fair 👍

AlexeyAkhunov · 2025-03-19T14:50:16Z

Public endpoints

Mainnet | https://heimdall-api.polygon.technology
Amoy Testnet | https://heimdall-api-amoy.polygon.technology

#14208 (comment)

VBulikov · 2025-03-21T12:56:29Z

VBulikov added Imp2 as Importance

shohamc1 mentioned this issue Mar 19, 2025

Add "operation timed out" as transient error #14220

Merged

VBulikov added the UX label Mar 21, 2025

shohamc1 mentioned this issue Mar 24, 2025

Add Heimdall transient errors #14272

Merged

shohamc1 closed this as completed in #14272 Mar 24, 2025

shohamc1 closed this as completed in 013c677 Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polygon first sync crash #14208

Polygon first sync crash #14208

wmitsuda commented Mar 18, 2025

wmitsuda commented Mar 18, 2025

shohamc1 commented Mar 18, 2025 •

edited

Loading

wmitsuda commented Mar 18, 2025 •

edited

Loading

taratorio commented Mar 19, 2025 •

edited

Loading

wmitsuda commented Mar 19, 2025

taratorio commented Mar 19, 2025

taratorio commented Mar 19, 2025

taratorio commented Mar 19, 2025

wmitsuda commented Mar 19, 2025

AlexeyAkhunov commented Mar 19, 2025

VBulikov commented Mar 21, 2025

Polygon first sync crash #14208

Polygon first sync crash #14208

Comments

wmitsuda commented Mar 18, 2025

wmitsuda commented Mar 18, 2025

shohamc1 commented Mar 18, 2025 • edited Loading

wmitsuda commented Mar 18, 2025 • edited Loading

taratorio commented Mar 19, 2025 • edited Loading

wmitsuda commented Mar 19, 2025

taratorio commented Mar 19, 2025

taratorio commented Mar 19, 2025

taratorio commented Mar 19, 2025

wmitsuda commented Mar 19, 2025

AlexeyAkhunov commented Mar 19, 2025

VBulikov commented Mar 21, 2025

shohamc1 commented Mar 18, 2025 •

edited

Loading

wmitsuda commented Mar 18, 2025 •

edited

Loading

taratorio commented Mar 19, 2025 •

edited

Loading