Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: New Nodes Cannot Sync #23

Closed
Tracked by #606
brennanjl opened this issue Sep 16, 2024 · 23 comments · Fixed by #29
Closed
Tracked by #606

Bug: New Nodes Cannot Sync #23

brennanjl opened this issue Sep 16, 2024 · 23 comments · Fixed by #29
Assignees

Comments

@brennanjl
Copy link

brennanjl commented Sep 16, 2024

The bug that caused the halt a couple days ago (mentioned here #17) appears to not be fully resolved. While node operators that were online at the time of the halt have been able to proceed as usual, new nodes that are syncing the network halt the block at which the error occurred. This has been noted by two separate joiners (see here #20) and has been reproduced by the Kwil team.

The Problem

The bug that is causing new nodes to fail to sync, while related to the previous halt, is a different bug. The previous halt seemed to cause CometBFT (the consensus engine that Kwil uses) to end up in an unexpected state while processing the byzantine validator punishment. The state is stored and managed by CometBFT, and thus isn't directly modifiable even by the Kwil team. We are working with the CometBFT team to further diagnose this issue and create a fix. In the meantime, we have created a fix that will allow node operators to join the network.

Immediate Fix: State Sync

The quickest and easiest way to fix this issue is to use statesync to allow new nodes to sync from snapshots. This is a band-aid fix, but it should allow onboarding operators to continue as normal.

In order for this to happen, new nodes will have to sync from nodes that are creating state snapshots. Kwil's node is already doing this. Since most new nodes use TSN's as the bootstrap nodes, we would also recommend that the TSN team has their validators do this (CC: @outerlook @MicBun). It is not necessary for other validators to configure this, but they are free to. To run kwild in snapshot mode, please refer to the docs. It shouldn't take more than a couple minutes to configure.

New nodes joining the network will also need to run the new v1.1.9 release. New node operators should configure snapshots on their nodes to connect properly. All new nodes should now use the following setup when running for the first time:

Step 1: Ensure psql is installed

Nodes should have the psql command line utility installed.

Step 2: Wipe All Application State

If you were previously running a node that failed to sync, make sure you delete all of your data. This means deleting all of your data from Postgres, as well as from ~.kwild/abci, ~.kwild/abci, and rcvdSnaps (some of these might not exist depending on your prior configuration, in which case that is ok).

Step 3: Run the node with the statesync config flags

kwild --root-dir ./my-peer-config/ --chain.statesync.enable=true --chain.statesync.rpc-servers='http://3.92.83.167:26657'

@outerlook @MicBun we may want to update TSN's docs for this. The above uses Kwil's node for syncing snapshots, but once you have TSN's nodes running this, you can replace it with your own endpoints if you want.

Our team will follow up with a plan for a longer term fix to re-enable blocksync once we have solved the root CometBFT issue.

EDIT: I added that you need to wipe all of your local state, and have psql installed. This was a step that I overlooked that many ran into within Discord.

@MicBun
Copy link
Contributor

MicBun commented Sep 17, 2024

The above uses Kwil's node for syncing snapshots, but once you have TSN's nodes running this, you can replace it with your own endpoints if you want.

Alright, will replace the statesync.rpc-servers with our nodes once it is available. Thank you @brennanjl for the simplified guidance.

@azgms
Copy link
Contributor

azgms commented Sep 17, 2024

Hi guys,can you provide binary builds for this release?

@MicBun
Copy link
Contributor

MicBun commented Sep 17, 2024

Hi @azgms, you can get the binary from our release page here:
https://github.com/truflation/tsn/releases/tag/v1.1.9

@azgms
Copy link
Contributor

azgms commented Sep 17, 2024

Thanks @MicBun!

@MicBun
Copy link
Contributor

MicBun commented Sep 18, 2024

Step 3: Run the node with the statesync config flags

Hi @brennanjl, I tried to use a snapshot from kwil endpoints by the CLI command you provided, however, the snapshot was rejected with context deadline exceeded. Is there any further configuration needed here?

{"level":"info","ts":1726639298.4200094,"logger":"kwild.cometbft","caller":"statesync/syncer.go:119","msg":"Discovered new snapshot","module":"statesync","height":441000,"format":0,"hash":"35675015C0CF8697AD630B960BBC3FA5753AE2FD3AFD969A51F3DFC95E737194"}
{"level":"info","ts":1726639298.421205,"logger":"kwild.cometbft","caller":"statesync/syncer.go:119","msg":"Discovered new snapshot","module":"statesync","height":440400,"format":0,"hash":"D3DB9111BBB2A3B29F36D66D5DCB2F80C436C3DCE9D874F20C8C0053207ECE55"}
{"level":"info","ts":1726639298.4212537,"logger":"kwild.cometbft","caller":"statesync/syncer.go:119","msg":"Discovered new snapshot","module":"statesync","height":439800,"format":0,"hash":"3D5A816239713E41DC27589E7B3FF8879F0747B6DDBDFEEA99C623E0D0D24552"}
{"level":"info","ts":1726639298.4212916,"logger":"kwild.cometbft","caller":"statesync/syncer.go:119","msg":"Discovered new snapshot","module":"statesync","height":439200,"format":0,"hash":"B2172F747BFFF1DCFCEB6D8CB006935194956EAC03EFFA35C5A65A676120371F"}
{"level":"info","ts":1726639306.8461366,"logger":"kwild.cometbft","caller":"light/client.go:559","msg":"VerifyHeader","module":"statesync","module":"light","height":441001,"hash":"57D589D77A9E81D152F763C0C6CD8C3EBC59E0B7B3CBCC5180D0C2B063C2B1CB"}
{"level":"info","ts":1726639308.4680336,"logger":"kwild.cometbft","caller":"light/client.go:559","msg":"VerifyHeader","module":"statesync","module":"light","height":441002,"hash":"E1909F617A42F34DF41BAB9B8BBEA37359544CC67D03B2F51F7786C20DD4617B"}
{"level":"info","ts":1726639309.1236897,"logger":"kwild.cometbft","caller":"statesync/syncer.go:323","msg":"Offering snapshot to ABCI app","module":"statesync","height":441000,"format":0,"hash":"35675015C0CF8697AD630B960BBC3FA5753AE2FD3AFD969A51F3DFC95E737194"}
{"level":"info","ts":1726639309.1237757,"logger":"kwild.state-syncer","caller":"statesync/statesync.go:93","msg":"Offering snapshot","height":441000,"format":0,"App Hash":"35675015c0cf8697ad630b960bbc3fa5753ae2fd3afd969a51f3dfc95e737194"}
{"level":"info","ts":1726639309.7724137,"logger":"kwild.cometbft","caller":"statesync/syncer.go:340","msg":"Snapshot accepted, restoring","module":"statesync","height":441000,"format":0,"hash":"35675015C0CF8697AD630B960BBC3FA5753AE2FD3AFD969A51F3DFC95E737194"}
{"level":"info","ts":1726639309.7725916,"logger":"kwild.cometbft","caller":"light/client.go:482","msg":"Header has already been verified","module":"statesync","module":"light","height":441000,"hash":"7630B0C001570E0AE8DCEFB2C6FB461E63132ED9A50FCF4F3C84CCD4A6B88816"}
{"level":"info","ts":1726639309.772599,"logger":"kwild.cometbft","caller":"statesync/syncer.go:441","msg":"Fetching snapshot chunk","module":"statesync","height":441000,"format":0,"chunk":0,"total":1}
{"level":"info","ts":1726639309.7730281,"logger":"kwild.cometbft","caller":"light/client.go:482","msg":"Header has already been verified","module":"statesync","module":"light","height":441001,"hash":"57D589D77A9E81D152F763C0C6CD8C3EBC59E0B7B3CBCC5180D0C2B063C2B1CB"}
{"level":"info","ts":1726639309.773077,"logger":"kwild.cometbft","caller":"light/client.go:482","msg":"Header has already been verified","module":"statesync","module":"light","height":441002,"hash":"E1909F617A42F34DF41BAB9B8BBEA37359544CC67D03B2F51F7786C20DD4617B"}
{"level":"info","ts":1726639310.2928545,"logger":"kwild.cometbft","caller":"light/client.go:482","msg":"Header has already been verified","module":"statesync","module":"light","height":441001,"hash":"57D589D77A9E81D152F763C0C6CD8C3EBC59E0B7B3CBCC5180D0C2B063C2B1CB"}
{"level":"info","ts":1726639310.292965,"logger":"kwild.cometbft","caller":"light/client.go:482","msg":"Header has already been verified","module":"statesync","module":"light","height":441000,"hash":"7630B0C001570E0AE8DCEFB2C6FB461E63132ED9A50FCF4F3C84CCD4A6B88816"}
{"level":"info","ts":1726639313.655442,"logger":"kwild.state-syncer","caller":"statesync/statesync.go:144","msg":"Applied snapshot chunk","height":441000,"index":0}
{"level":"info","ts":1726639313.6555147,"logger":"kwild.state-syncer","caller":"statesync/statesync.go:151","msg":"All chunks received - Starting DB restore process"}
{"level":"info","ts":1726639313.6720688,"logger":"kwild.state-syncer","caller":"statesync/statesync.go:205","msg":"Restore DB: ","command":"psql --username kwild --host 127.0.0.1 --port 5432 --dbname kwild --no-password"}
{"level":"error","ts":1726639313.6736848,"logger":"kwild.abci","caller":"abci/abci.go:614","msg":"Failed to apply snapshot chunk: exec: \"psql\": executable file not found in $PATH\nreject snapshot","stacktrace":"github.com/kwilteam/kwil-db/internal/abci.(*AbciApp).ApplySnapshotChunk\n\t/home/micbun/go/pkg/mod/github.com/kwilteam/[email protected]/internal/abci/abci.go:614\ngithub.com/cometbft/cometbft/abci/client.(*localClient).ApplySnapshotChunk\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/abci/client/local_client.go:150\ngithub.com/cometbft/cometbft/proxy.(*appConnSnapshot).ApplySnapshotChunk\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/proxy/app_conn.go:220\ngithub.com/cometbft/cometbft/statesync.(*syncer).applyChunks\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:367\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:303\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"info","ts":1726639313.6737542,"logger":"kwild.cometbft","caller":"statesync/syncer.go:375","msg":"Applied snapshot chunk to ABCI app","module":"statesync","height":441000,"format":0,"chunk":0,"total":1}
{"level":"info","ts":1726639313.6737986,"logger":"kwild.cometbft","caller":"statesync/syncer.go:206","msg":"Snapshot rejected","module":"statesync","height":441000,"format":0,"hash":"35675015C0CF8697AD630B960BBC3FA5753AE2FD3AFD969A51F3DFC95E737194"}
{"level":"info","ts":1726639314.3310657,"logger":"kwild.cometbft","caller":"light/client.go:559","msg":"VerifyHeader","module":"statesync","module":"light","height":440401,"hash":"4552EF674FE7E5B58C66348B18B2B6311C34E562BD03AB8341CFB340BDC0CD81"}
{"level":"info","ts":1726639327.1242297,"logger":"kwild.cometbft","caller":"pex/pex_reactor.go:457","msg":"Ensure peers","module":"pex","numOutPeers":6,"numInPeers":0,"numDialing":0,"numToDial":4}
{"level":"info","ts":1726639327.1243608,"logger":"kwild.cometbft","caller":"pex/pex_reactor.go:522","msg":"We need more addresses. Sending pexRequest to random peer","module":"pex","peer":"Peer{MConn{18.189.163.27:26656} c6d2ea1e573d207cc31b7e17c771ab8ca2091b22 out}"}
{"level":"info","ts":1726639328.104377,"logger":"kwild.cometbft","caller":"service/service.go:139","msg":"service start","module":"p2p","peer":"[email protected]:26656","msg":"Starting Peer service","impl":"Peer{MConn{100.29.168.198:26656} 376a6e498c3aa0e1950040f07d2f8430c1824f43 out}"}
{"level":"info","ts":1726639328.1044643,"logger":"kwild.cometbft","caller":"service/service.go:139","msg":"service start","module":"p2p","peer":"[email protected]:26656","msg":"Starting MConnection service","impl":"MConn{100.29.168.198:26656}"}
{"level":"info","ts":1726639328.5130754,"logger":"kwild.cometbft","caller":"service/service.go:139","msg":"service start","module":"p2p","peer":"[email protected]:26656","msg":"Starting Peer service","impl":"Peer{MConn{52.7.138.76:26656} 436a060b5dcc408edf7be221347552ae7524dc65 out}"}
{"level":"info","ts":1726639328.5131476,"logger":"kwild.cometbft","caller":"service/service.go:139","msg":"service start","module":"p2p","peer":"[email protected]:26656","msg":"Starting MConnection service","impl":"MConn{52.7.138.76:26656}"}
{"level":"info","ts":1726639343.6755757,"logger":"kwild.cometbft","caller":"light/client.go:1012","msg":"error from light block request from primary, removing...","module":"statesync","module":"light","error":"post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","height":440953,"primary":"http{http://3.92.83.167:26657}"}
{"level":"error","ts":1726639343.6757576,"logger":"kwild.cometbft","caller":"light/client.go:1115","msg":"error on light block request from witness, removing...","module":"statesync","module":"light","error":"post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","primary":"http{http://3.92.83.167:26657}","goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).findNewPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1115\ngithub.com/cometbft/cometbft/light.(*Client).lightBlockFromPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1014\ngithub.com/cometbft/cometbft/light.(*Client).backwards\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:944\ngithub.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:592\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"error","ts":1726639343.6758423,"logger":"kwild.cometbft","caller":"light/client.go:1123","msg":"failed to remove witnesses","module":"statesync","module":"light","err":"no witnesses connected. please reset light client","witnessesToRemove":[0],"goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).findNewPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1123\ngithub.com/cometbft/cometbft/light.(*Client).lightBlockFromPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1014\ngithub.com/cometbft/cometbft/light.(*Client).backwards\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:944\ngithub.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:592\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"error","ts":1726639343.6758978,"logger":"kwild.cometbft","caller":"light/client.go:604","msg":"Can't verify","module":"statesync","module":"light","err":"failed to obtain the header at height #440953: post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:604\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"info","ts":1726639343.6759436,"logger":"kwild.cometbft","caller":"statesync/syncer.go:260","msg":"failed to fetch and verify app hash","module":"statesync","err":"failed to obtain the header at height #440953: post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded"}
{"level":"info","ts":1726639343.6760156,"logger":"kwild.cometbft","caller":"statesync/syncer.go:206","msg":"Snapshot rejected","module":"statesync","height":440400,"format":0,"hash":"D3DB9111BBB2A3B29F36D66D5DCB2F80C436C3DCE9D874F20C8C0053207ECE55"}
{"level":"info","ts":1726639344.6783712,"logger":"kwild.cometbft","caller":"light/client.go:559","msg":"VerifyHeader","module":"statesync","module":"light","height":439801,"hash":"35E451B06E56DFD5CEC65944F771FA34AE88B7D1AD777F81BFFDA3D2A18D61F2"}
{"level":"info","ts":1726639357.1237133,"logger":"kwild.cometbft","caller":"pex/pex_reactor.go:457","msg":"Ensure peers","module":"pex","numOutPeers":8,"numInPeers":0,"numDialing":0,"numToDial":2}
{"level":"info","ts":1726639357.1237614,"logger":"kwild.cometbft","caller":"pex/file.go:22","msg":"Saving AddrBook to file","module":"p2p","book":"/home/micbun/my-peer-config/abci/addrbook.json","size":20}
{"level":"info","ts":1726639357.1260247,"logger":"kwild.cometbft","caller":"pex/pex_reactor.go:522","msg":"We need more addresses. Sending pexRequest to random peer","module":"pex","peer":"Peer{MConn{100.29.168.198:26656} 376a6e498c3aa0e1950040f07d2f8430c1824f43 out}"}
{"level":"info","ts":1726639373.6765544,"logger":"kwild.cometbft","caller":"light/client.go:1012","msg":"error from light block request from primary, removing...","module":"statesync","module":"light","error":"post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","height":440968,"primary":"http{http://3.92.83.167:26657}"}
{"level":"error","ts":1726639373.6767697,"logger":"kwild.cometbft","caller":"light/client.go:1115","msg":"error on light block request from witness, removing...","module":"statesync","module":"light","error":"post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","primary":"http{http://3.92.83.167:26657}","goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).findNewPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1115\ngithub.com/cometbft/cometbft/light.(*Client).lightBlockFromPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1014\ngithub.com/cometbft/cometbft/light.(*Client).backwards\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:944\ngithub.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:592\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"error","ts":1726639373.676848,"logger":"kwild.cometbft","caller":"light/client.go:1123","msg":"failed to remove witnesses","module":"statesync","module":"light","err":"no witnesses connected. please reset light client","witnessesToRemove":[0],"goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).findNewPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1123\ngithub.com/cometbft/cometbft/light.(*Client).lightBlockFromPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1014\ngithub.com/cometbft/cometbft/light.(*Client).backwards\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:944\ngithub.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:592\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"error","ts":1726639373.6769154,"logger":"kwild.cometbft","caller":"light/client.go:604","msg":"Can't verify","module":"statesync","module":"light","err":"failed to obtain the header at height #440968: post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:604\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"info","ts":1726639373.6769733,"logger":"kwild.cometbft","caller":"statesync/syncer.go:260","msg":"failed to fetch and verify app hash","module":"statesync","err":"failed to obtain the header at height #440968: post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded"}
{"level":"info","ts":1726639373.6770287,"logger":"kwild.cometbft","caller":"statesync/syncer.go:206","msg":"Snapshot rejected","module":"statesync","height":439800,"format":0,"hash":"3D5A816239713E41DC27589E7B3FF8879F0747B6DDBDFEEA99C623E0D0D24552"}
{"level":"info","ts":1726639374.661516,"logger":"kwild.cometbft","caller":"light/client.go:559","msg":"VerifyHeader","module":"statesync","module":"light","height":439201,"hash":"14D9CDD91AD247C8B8DA33ECB38475F9749BE4C0C031952909C0893948536194"}
{"level":"info","ts":1726639387.1224551,"logger":"kwild.cometbft","caller":"pex/pex_reactor.go:457","msg":"Ensure peers","module":"pex","numOutPeers":8,"numInPeers":0,"numDialing":0,"numToDial":2}
{"level":"info","ts":1726639387.1225731,"logger":"kwild.cometbft","caller":"pex/pex_reactor.go:522","msg":"We need more addresses. Sending pexRequest to random peer","module":"pex","peer":"Peer{MConn{3.140.58.76:26656} 34599966ce4b67628f4cfa99fdca74ea2d039018 out}"}
{"level":"info","ts":1726639403.678048,"logger":"kwild.cometbft","caller":"light/client.go:1012","msg":"error from light block request from primary, removing...","module":"statesync","module":"light","error":"post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","height":440964,"primary":"http{http://3.92.83.167:26657}"}
{"level":"error","ts":1726639403.6782522,"logger":"kwild.cometbft","caller":"light/client.go:1115","msg":"error on light block request from witness, removing...","module":"statesync","module":"light","error":"post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","primary":"http{http://3.92.83.167:26657}","goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).findNewPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1115\ngithub.com/cometbft/cometbft/light.(*Client).lightBlockFromPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1014\ngithub.com/cometbft/cometbft/light.(*Client).backwards\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:944\ngithub.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:592\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"error","ts":1726639403.6783211,"logger":"kwild.cometbft","caller":"light/client.go:1123","msg":"failed to remove witnesses","module":"statesync","module":"light","err":"no witnesses connected. please reset light client","witnessesToRemove":[0],"goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).findNewPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1123\ngithub.com/cometbft/cometbft/light.(*Client).lightBlockFromPrimary\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:1014\ngithub.com/cometbft/cometbft/light.(*Client).backwards\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:944\ngithub.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:592\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"error","ts":1726639403.6784005,"logger":"kwild.cometbft","caller":"light/client.go:604","msg":"Can't verify","module":"statesync","module":"light","err":"failed to obtain the header at height #440964: post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded","goversion":"go1.23.1","stacktrace":"github.com/cometbft/cometbft/light.(*Client).verifyLightBlock\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:604\ngithub.com/cometbft/cometbft/light.(*Client).VerifyLightBlockAtHeight\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/light/client.go:493\ngithub.com/cometbft/cometbft/statesync.(*lightClientStateProvider).AppHash\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/stateprovider.go:94\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:258\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}
{"level":"info","ts":1726639403.6784616,"logger":"kwild.cometbft","caller":"statesync/syncer.go:260","msg":"failed to fetch and verify app hash","module":"statesync","err":"failed to obtain the header at height #440964: post failed: Post \"http://3.92.83.167:26657\": context deadline exceeded"}
{"level":"info","ts":1726639403.6785295,"logger":"kwild.cometbft","caller":"statesync/syncer.go:206","msg":"Snapshot rejected","module":"statesync","height":439200,"format":0,"hash":"B2172F747BFFF1DCFCEB6D8CB006935194956EAC03EFFA35C5A65A676120371F"}

@brennanjl
Copy link
Author

@MicBun based on the following error, it seems as though you did not complete step 1. You need to install psql on your machine.

{"level":"error","ts":1726639313.6736848,"logger":"kwild.abci","caller":"abci/abci.go:614","msg":"Failed to apply snapshot chunk: exec: \"psql\": executable file not found in $PATH\nreject snapshot","stacktrace":"github.com/kwilteam/kwil-db/internal/abci.(*AbciApp).ApplySnapshotChunk\n\t/home/micbun/go/pkg/mod/github.com/kwilteam/[email protected]/internal/abci/abci.go:614\ngithub.com/cometbft/cometbft/abci/client.(*localClient).ApplySnapshotChunk\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/abci/client/local_client.go:150\ngithub.com/cometbft/cometbft/proxy.(*appConnSnapshot).ApplySnapshotChunk\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/proxy/app_conn.go:220\ngithub.com/cometbft/cometbft/statesync.(*syncer).applyChunks\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:367\ngithub.com/cometbft/cometbft/statesync.(*syncer).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:303\ngithub.com/cometbft/cometbft/statesync.(*syncer).SyncAny\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/syncer.go:185\ngithub.com/cometbft/cometbft/statesync.(*Reactor).Sync\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/statesync/reactor.go:286\ngithub.com/cometbft/cometbft/node.startStateSync.func1\n\t/home/micbun/go/pkg/mod/github.com/cometbft/[email protected]/node/setup.go:519"}

@MicBun
Copy link
Contributor

MicBun commented Sep 18, 2024

Hi @brennanjl, just looking at the log once again by doing the steps again.
Yeah it is not in PATH my bad, will try it again

@MicBun
Copy link
Contributor

MicBun commented Sep 18, 2024

@brennanjl, I tried with your endpoint, and it was successful to retrieve the database, however when we try it on our server we got panic

kwild --root-dir ./my-peer-config/ --chain.statesync.enable=true --chain.statesync.rpc-servers='https://staging.node-1.tsn.truflation.com:26657,https://staging.node-2.tsn.truflation.com:26657'
{"level":"warn","ts":1726667781.1088464,"logger":"kwild","caller":"server/build.go:632","msg":"failed to get latest snapshot from snap provider: post failed: Post \"https://staging.node-1.tsn.truflation.com:26657\": context deadline exceeded"}
{"level":"warn","ts":1726667791.1093252,"logger":"kwild","caller":"server/build.go:632","msg":"failed to get latest snapshot from snap provider: post failed: Post \"https://staging.node-2.tsn.truflation.com:26657\": context deadline exceeded"}
{"level":"info","ts":1726667791.1094348,"logger":"kwild","caller":"server/build.go:329","msg":"closing event store"}
{"level":"info","ts":1726667791.1096592,"logger":"kwild","caller":"server/build.go:329","msg":"closing main DB"}
Error: panic while building kwild: failed to configure state syncer, failed to fetch trust options from the remote server.:

stack:

goroutine 1 [running]:
github.com/kwilteam/kwil-db/cmd/kwild/server.New.func1()
        /home/micbun/go/pkg/mod/github.com/kwilteam/[email protected]/cmd/kwild/server/server.go:81 +0x24d
panic({0x1841ac0?, 0xc000262c60?})
        /usr/local/go/src/runtime/panic.go:785 +0x132
github.com/kwilteam/kwil-db/cmd/kwild/server.failBuild(...)
        /home/micbun/go/pkg/mod/github.com/kwilteam/[email protected]/cmd/kwild/server/build.go:946
github.com/kwilteam/kwil-db/cmd/kwild/server.buildStatesyncer(0xc0000d04e0)
        /home/micbun/go/pkg/mod/github.com/kwilteam/[email protected]/cmd/kwild/server/build.go:669 +0x885
github.com/kwilteam/kwil-db/cmd/kwild/server.buildServer(0xc0000d04e0, 0xc00032a100)
        /home/micbun/go/pkg/mod/github.com/kwilteam/[email protected]/cmd/kwild/server/build.go:183 +0xaf
github.com/kwilteam/kwil-db/cmd/kwild/server.New({0x1cb5968, 0xc000470050}, 0xc0004f20f0, 0xc00051f000, 0xc00061a510, 0x0)
        /home/micbun/go/pkg/mod/github.com/kwilteam/[email protected]/cmd/kwild/server/server.go:120 +0xa09
github.com/kwilteam/kwil-db/cmd/kwild/root.RootCmd.func1(0xc000492908, {0xc0002c54c0, 0x0, 0x4})
        /home/micbun/go/pkg/mod/github.com/kwilteam/[email protected]/cmd/kwild/root/root.go:122 +0x613
github.com/spf13/cobra.(*Command).execute(0xc000492908, {0xc00014c010, 0x4, 0x4})
        /home/micbun/go/pkg/mod/github.com/spf13/[email protected]/command.go:983 +0xaaa
github.com/spf13/cobra.(*Command).ExecuteC(0xc000492908)
        /home/micbun/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0x0?)
        /home/micbun/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039 +0x13
main.main()
        /home/micbun/truflation/tsn/cmd/kwild/main.go:11 +0x18

panic while building kwild: failed to configure state syncer, failed to fetch trust options from the remote server.:

What are these trust options?
ps. I also try with http without s none works, should it matter?

edit: turns out our 26657 is not open. I will edit our server option

@brennanjl
Copy link
Author

ps. I also try with http without s none works, should it matter?

Yes. I am not sure whether you have TLS configured on your nodes, but it does matter (e.g. if you have TLS configured and put HTTP, it will likely fail, and vice-versa).

@jchappelow
Copy link
Contributor

jchappelow commented Sep 18, 2024

Regarding TLS, you won't be able to simply assert https in the client URL without the RPC server being configure for that. Unfortunately it all comes down to where the server gets it's TLS key and certificate, and how the client knows to trust it. There are two general solutions for this, both with some pains:

  • server uses self-signed key pair, the certificate is shared on a side channel with clients, all clients use that certificate in their config. Kwil apps support this, but its very cumbersome.
  • server obtains a certificate signed by a root certificate authority (CA) such as LetsEncrypt/Thawte/etc, periodically has to renew it and reload it, but clients require no config because of the root CAs trust model

The second option above is transparent to clients and fairly simple for server operators, with caveats: the pain of renewal and loading is typically handled by a cron job to ensure it is not expired, and a reverse proxy like nginx to use the certificate to terminate TLS before proxying the request to the backend (kwild's RPC server). Managed services can also assist but are generally overkill.

For instance, with the Kwil node RPC server (not the one required for statesync though) we can access it two ways presently:

  • http://node0.staging.tsn.kwil.com:8484/ <-- direct to kwild RPC server, no TLS
  • https://node0.staging.tsn.kwil.com/ <-- reverse proxy terminating TLS for kwild on the node's host machine, with certbot renewing the certificate periodically

I would be happy to share some suggested nginx config templates. Also, kwild could consider building in letsencrypt certificate requests/renewal, but it is somewhat complex and generally out-of-scope for node software.

@MicBun
Copy link
Contributor

MicBun commented Sep 19, 2024

Hi @jchappelow, related to the state sync or not, feel free to share the knowledge 😉

Hi @brennanjl, I got an error message saying there is no snapshot available. Currently, the snapshot directory is located in:

./root/.kwild/snapshot
inside it have:
genesis.json            kwildb-snapshot.sql.gz

Command used:

kwild --root-dir ./my-peer-config/ --chain.statesync.enable=true --chain.statesync.rpc-servers='http://18.189.163.27:26657,http://3.140.58.76:26657'

The error:

{"level":"warn","ts":1726716356.0348566,"logger":"kwild","caller":"server/build.go:632","msg":"failed to get latest snapshot from snap provider: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: no snapshots available"}
{"level":"warn","ts":1726716356.5448039,"logger":"kwild","caller":"server/build.go:632","msg":"failed to get latest snapshot from snap provider: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: no snapshots available"}

Error: panic while building kwild: failed to configure state syncer, failed to fetch trust options from the remote server.:

Another thing I tried, using DNS resulted in a context deadline error:

kwild --root-dir ./my-peer-config/ --chain.statesync.enable=true --chain.statesync.rpc-servers='http://staging.node-1.tsn.truflation.com:26657,http://staging.node-2.tsn.truflation.com:26657'
kwild --root-dir ./my-peer-config/ --chain.statesync.enable=true --chain.statesync.rpc-servers='https://staging.node-1.tsn.truflation.com:26657,https://staging.node-2.tsn.truflation.com:26657'

Both resulted in context deadline exceeded and panic error:

{"level":"warn","ts":1726716454.3593276,"logger":"kwild","caller":"server/build.go:632","msg":"failed to get latest snapshot from snap provider: post failed: Post \"http://staging.node-1.tsn.truflation.com:26657\": context deadline exceeded"}
{"level":"warn","ts":1726716464.3602548,"logger":"kwild","caller":"server/build.go:632","msg":"failed to get latest snapshot from snap provider: post failed: Post \"http://staging.node-2.tsn.truflation.com:26657\": context deadline exceeded"}

Error: panic while building kwild: failed to configure state syncer, failed to fetch trust options from the remote server.:

Is there anything I miss here?

@jchappelow
Copy link
Contributor

Regarding failed to get latest snapshot from snap provider: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: no snapshots available, it probably hasn't made it's first snapshot.

The snapshots.recurring_height setting in config.toml is probably thousands of blocks. I guess snapshots.enabled is true?

@MicBun
Copy link
Contributor

MicBun commented Sep 19, 2024

The snapshots.recurring_height setting in config.toml is probably thousands of blocks. I guess snapshots.enabled is true?

Yes, it is true. We use the default value for recurring_height blocks.

@jchappelow how to confirm that the snapshot has been created or not? We have:

./root/.kwild/snapshot
inside it have:
genesis.json            kwildb-snapshot.sql.gz

@outerlook
Copy link
Collaborator

note: the one inside was created manually with kwil-admin as a test (then it's not creating automatically)

@outerlook
Copy link
Collaborator

more things to help debugging:

Their cometbft rpc endpoints are available now at 26657 if this helps: http://staging.node-2.tsn.truflation.com:26657/

Node 1 Config Dump (working, but no automatic snapshot)
RootDir = '/root/.kwild'

[app]
admin_listen_addr = '/tmp/kwild.socket'
admin_notls = false
admin_pass = ''
db_read_timeout = 60000000000.0
genesis_state = ''
hostname = 'staging.node-1.tsn.truflation.com'
http_listen_addr = '0.0.0.0:8080'
jsonrpc_listen_addr = '0.0.0.0:8484'
pg_db_host = 'kwil-postgres'
pg_db_name = 'kwild'
pg_db_pass = ''
pg_db_port = '5432'
pg_db_user = 'kwild'
private_key_path = '/root/.kwild/private_key'
profile_file = ''
profile_mode = ''
rpc_req_limit = 4200000.0
rpc_timeout = 45000000000.0
tls_cert_file = '/root/.kwild/rpc.cert'
tls_key_file = '/root/.kwild/rpc.key'

[app.extensions]

[app.snapshots]
enabled = true
max_row_size = 4194304.0
max_snapshots = 3.0
recurring_height = 14400.0
snapshot_dir = '/root/.kwild/snapshots'

[chain]
moniker = 'truflation-node-01'

[chain.consensus]
timeout_commit = 6000000000.0
timeout_precommit = 2000000000.0
timeout_prevote = 2000000000.0
timeout_propose = 3000000000.0

[chain.mempool]
cache_size = 10000.0
max_tx_bytes = 4194304.0
max_txs_bytes = 536870912.0
size = 5000.0

[chain.p2p]
addr_book_strict = false
allow_duplicate_ip = true
dial_timeout = 3000000000.0
external_address = 'http://staging.node-1.tsn.truflation.com:26656'
handshake_timeout = 20000000000.0
listen_addr = 'tcp://0.0.0.0:26656'
max_num_inbound_peers = 40.0
max_num_outbound_peers = 10.0
persistent_peers = 'c6d2ea1e573d207cc31b7e17c771ab8ca2091b22@staging.node-1.tsn.truflation.com:26656,34599966ce4b67628f4cfa99fdca74ea2d039018@staging.node-2.tsn.truflation.com:26656'
pex = true
seed_mode = false
seeds = ''
unconditional_peer_ids = ''

[chain.rpc]
broadcast_tx_timeout = 15000000000.0
listen_addr = 'tcp://0.0.0.0:26657'

[chain.statesync]
TrustHash = ''
TrustHeight = 0.0
TrustPeriod = 36000000000000.0
chunk_request_timeout = 10000000000.0
discovery_time = 15000000000.0
enable = false
rpc_servers = ''
snapshot_dir = 'rcvdSnaps'

[log]
consensus_level = ''
db_level = ''
format = 'json'
level = 'info'
output_paths = ['stdout', 'kwild.log', 'kwild.log']
rpc_level = ''
time_format = 'epochfloat'
Node 2 Config dump (halted, no auto snap)
RootDir = '/root/.kwild'

[app]
admin_listen_addr = '/tmp/kwild.socket'
admin_notls = false
admin_pass = ''
db_read_timeout = 60000000000.0
genesis_state = ''
hostname = 'staging.node-2.tsn.truflation.com'
http_listen_addr = '0.0.0.0:8080'
jsonrpc_listen_addr = '0.0.0.0:8484'
pg_db_host = 'kwil-postgres'
pg_db_name = 'kwild'
pg_db_pass = ''
pg_db_port = '5432'
pg_db_user = 'kwild'
private_key_path = '/root/.kwild/private_key'
profile_file = ''
profile_mode = ''
rpc_req_limit = 4200000.0
rpc_timeout = 45000000000.0
tls_cert_file = '/root/.kwild/rpc.cert'
tls_key_file = '/root/.kwild/rpc.key'

[app.extensions]

[app.snapshots]
enabled = true
max_row_size = 4194304.0
max_snapshots = 3.0
recurring_height = 14400.0
snapshot_dir = '/root/.kwild/snapshots'

[chain]
moniker = 'truflation-node-02'

[chain.consensus]
timeout_commit = 6000000000.0
timeout_precommit = 2000000000.0
timeout_prevote = 2000000000.0
timeout_propose = 3000000000.0

[chain.mempool]
cache_size = 10000.0
max_tx_bytes = 4194304.0
max_txs_bytes = 536870912.0
size = 5000.0

[chain.p2p]
addr_book_strict = false
allow_duplicate_ip = true
dial_timeout = 3000000000.0
external_address = 'http://staging.node-2.tsn.truflation.com:26656'
handshake_timeout = 20000000000.0
listen_addr = 'tcp://0.0.0.0:26656'
max_num_inbound_peers = 40.0
max_num_outbound_peers = 10.0
persistent_peers = 'c6d2ea1e573d207cc31b7e17c771ab8ca2091b22@staging.node-1.tsn.truflation.com:26656,34599966ce4b67628f4cfa99fdca74ea2d039018@staging.node-2.tsn.truflation.com:26656'
pex = true
seed_mode = false
seeds = ''
unconditional_peer_ids = ''

[chain.rpc]
broadcast_tx_timeout = 15000000000.0
listen_addr = 'tcp://0.0.0.0:26657'

[chain.statesync]
TrustHash = ''
TrustHeight = 0.0
TrustPeriod = 36000000000000.0
chunk_request_timeout = 10000000000.0
discovery_time = 15000000000.0
enable = false
rpc_servers = ''
snapshot_dir = 'rcvdSnaps'

[log]
consensus_level = ''
db_level = ''
format = 'json'
level = 'info'
output_paths = ['stdout', 'kwild.log', 'kwild.log']
rpc_level = ''
time_format = 'epochfloat'

@outerlook
Copy link
Collaborator

here are some logs for node 2, but it went through some filtering and tail to remove some noise:

2024-09-19-node2-tail-filtered.zip

here's the complete filter process

Filters

to filter out this:

{"level":"info","ts":1725903985.1812983,"logger":"kwild.user-jsonrpc-server","caller":"jsonrpc/server.go:526","msg":"request success","method":"user.schema","elapsed":0.000011745}

used "request success","method":"user.schema" in grep -v

filtering:

{"level":"warn","ts":1725903553.4233408,"logger":"kwild.cometbft","caller":"p2p/switch.go:340","msg":"Stopping peer for error","module":"p2p","peer":"Peer{MConn{54.172.111.34:26656} 6718219601786d2511690ced9b020a770361fe76 out}","err":"EOF"}

with: "logger":"kwild.cometbft","caller":"p2p/switch.go:340","msg":"Stopping peer for error"

filtering

{"level":"info","ts":1725903591.2870367,"logger":"kwild.cometbft","caller":"state/execution.go:230","msg":"finalized block","module":"state","height":21741,"num_txs_res":0,"num_val_updates"

with "state/execution.go:230","msg":"finalized block"

filtering

{"level":"info","ts":1725903738.7070634,"logger":"kwild.user-jsonrpc-server","caller":"jsonrpc/server.go:526","msg":"request success","method":"user.call","elapsed":0.006556949}

with "msg":"request success","method":"user.call"

filtering

{"level":"info","ts":1725930896.0628228,"logger":"kwild.user-jsonrpc-server","caller":"jsonrpc/server.go:517","msg":"request success","method":"user.ping","elapsed":0.000004757}

well, I think we're not interested in request success messages now, right?

with "msg":"request success"

filtering

{"level":"info","ts":1726741955.9040473,"logger":"kwild.user-jsonrpc-server","caller":"jsonrpc/server.go:519","msg":"request failure","method":"user.schema","elapsed":0.000007587,"code":-301,"message":"dataset not found"}

with "message":"dataset not found"

then used tail, but first message from log here is from:
GMT: Sunday, September 15, 2024 10:59:19 PM

@outerlook
Copy link
Collaborator

outerlook commented Sep 19, 2024

other checks at node 2:

  • is disk full? No:
     Filesystem      Size  Used Avail Use% Mounted on
     devtmpfs        1,9G     0  1,9G   0% /dev
     tmpfs           1,9G     0  1,9G   0% /dev/shm
     tmpfs           1,9G  632K  1,9G   1% /run
     tmpfs           1,9G     0  1,9G   0% /sys/fs/cgroup
     /dev/nvme0n1p1  8,0G  2,4G  5,7G  30% /
     /dev/nvme1n1     50G  7,2G   43G  15% /data         <---- we store it here
     tmpfs           386M     0  386M   0% /run/user/1000
    
  • logs at the same minute as last known block check here are available if you grep 17265854 on the logs, so there are some errors there to check

@jchappelow
Copy link
Contributor

jchappelow commented Sep 19, 2024

Heads up -- if given an relative path to app.snapshots.snapshot_dir, the preview branch writes to a "snapshots" in the working directory, wherever the binary was executed from, which is obviously incorrect (it should be in the root dir). Sorry for the confusion. I see you have snapshot_dir = '/root/.kwild/snapshots' so that should be exactly there, but I wanted to point out that possible pitfall.

I'm still investigating logs to see what might be cause the deadline exceeded.

@jchappelow
Copy link
Contributor

jchappelow commented Sep 19, 2024

The directory where tsn daemon writes out snapshots for serving to peers looks a bit like this on the kwil node:

$  tree -L 3 snapshots/
snapshots/
├── block-456600
│   └── format-0
│       ├── chunks
│       └── header.json
├── block-457200
│   └── format-0
│       ├── chunks
│       └── header.json
├── block-457800
│   └── format-0
│       ├── chunks
│       └── header.json
└── block-458400
    └── format-0
        ├── chunks
        └── header.json

13 directories, 4 files

Given this config:

[app.snapshots]

# Enables snapshots
enabled = true

# Path to the snapshots directory
snapshot_dir = "snapshots"

# Specifies the block heights(multiples of recurring_height) at which the snapshot should be taken
recurring_height = 600

# Maximum number of snapshots to store
max_snapshots = 4

I have it set to make snapshots every 600 blocks because 14400 felt too long (about 24 hrs).

Also, I got bit by the 0.8.x quirk with snapshot_dir being relative to cwd, and this ended up in ~tsn/snapshots instead of <root_dir>/snapshots. 😊

@outerlook
Copy link
Collaborator

hey @jchappelow, thanks for that! I'll try the exact configs to see if this works.

also, would it only create the snapshot once at 14400 multiples, or would it try immediately after passing some multiple, if it was just configured correctly?

@jchappelow
Copy link
Contributor

also, would it only create the snapshot once at 14400 multiples, or would it try immediately after passing some multiple, if it was just configured correctly?

On preview/v0.8 it creates a snapshot at every 14400 multiple. We have since updated on main for 0.9 to make a snapshot immediately on startup if it doesn't already have one created.

@charithabandi noticed something in your tail filtered logs: "CONSENSUS FAILURE!!!","module":"consensus","err":"failed to apply block; error commit failed for application: failed to commit transaction app: failed to connect to 'host=kwil-postgres user=kwild database=kwild': dial error (timeout: context deadline exceeded)"

It looks like postgres was down/unreachable. This makes me wonder how it's configured. There are two things to check: if using docker, I'd suggest adding --shm-size 256m to the docker run command as described in this docs section, and it may be necessary to include each of "docker";, "containerd"; and "postgresql"; to the Unattended-Upgrade::Package-Blacklist section of /etc/apt/apt.conf.d/50unattended-upgrades (using ubuntu as an example here). Or even completely disable unattended upgrades since these typically restart the processes they update: sudo systemctl disable --now apt-daily-upgrade.timer.


Slightly more concerning issue however:

{"level":"error","ts":1726589057.9866025,"logger":"kwild.cometbft","caller":"blocksync/reactor.go:499","msg":"Error in validation","module":"blocksync","err":"wrong Block.Header.AppHash. Expected FD2E599AE419DBEB2A226D95B1CD157C78036EC92720B4F182A31BA5386C0F48, got A3BBE42F8D1DA1871E1B5E2A6C49E7C2216851EFA5790E48DC6AED0151F6758D","goversion":"go1.22.1","stacktrace":"github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine\n\t/go/pkg/mod/github.com/cometbft/[email protected]/blocksync/reactor.go:499\ngithub.com/cometbft/cometbft/blocksync.(*Reactor).OnStart.func1\n\t/go/pkg/mod/github.com/cometbft/[email protected]/blocksync/reactor.go:132"}

Is it possible that this node may have sychronized prior to kwilteam/kwil-db@ad379ae?
Some bad snapshots were created and disseminated prior to that fix, unfortunately. If that's the case, it means a resync may be in order for each of them getting the error.
Let's make extra sure that we're running with that latest change on both snapshot producers and consumers (statesyncing nodes).

On a related note, I'm really looking forward to migrating when 0.9 is ready. 😅

@brennanjl
Copy link
Author

brennanjl commented Sep 19, 2024

@outerlook this is the exact same issue (same apphash mismatch) as noted here: trufnetwork/node#609 (comment)

You need to re-sync your node, because for some reason it has bad state. There's a very good chance that it synced bad state previously via statesync. On Friday, in Slack, you said that you synced your nodes using this pr. However, this PR had an issue with statesync (it wouldn't sync all state); this is why we had to add this pr for the v1.1.9 release. But if you didn't wipe your node's state in between using these two commits, then you would in fact be at an invalid state (which explains why the node fails immediately).

Following the commit history of TSN:

# TSN Commit History
...
b1822e5 <- the commit that fixed statesync, and where 1.1.9 was tagged
...
ffa5f44 <- the commit that tried to add statesync, but added a bug. You tested this on your nodes according to your slack msg
8eaae20 <- the commit prior to statesync
...

If you still had the state you synced using the ffa5f44 commit, then it would explain why you are in a bad state and failing to sync.

@outerlook
Copy link
Collaborator

Thank you! That explains. I'll wipe data on that node and restart it via statesync. I'll also add a lower value for snapshot recurring height

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants