-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: New Nodes Cannot Sync #23
Comments
Alright, will replace the |
Hi guys,can you provide binary builds for this release? |
Hi @azgms, you can get the binary from our release page here: |
Thanks @MicBun! |
Hi @brennanjl, I tried to use a snapshot from kwil endpoints by the CLI command you provided, however, the snapshot was rejected with
|
@MicBun based on the following error, it seems as though you did not complete step 1. You need to install
|
Hi @brennanjl, just looking at the log once again by doing the steps again. |
@brennanjl, I tried with your endpoint, and it was successful to retrieve the database, however when we try it on our server we got panic
What are these trust options? edit: turns out our |
Yes. I am not sure whether you have TLS configured on your nodes, but it does matter (e.g. if you have TLS configured and put HTTP, it will likely fail, and vice-versa). |
Regarding TLS, you won't be able to simply assert https in the client URL without the RPC server being configure for that. Unfortunately it all comes down to where the server gets it's TLS key and certificate, and how the client knows to trust it. There are two general solutions for this, both with some pains:
The second option above is transparent to clients and fairly simple for server operators, with caveats: the pain of renewal and loading is typically handled by a cron job to ensure it is not expired, and a reverse proxy like nginx to use the certificate to terminate TLS before proxying the request to the backend (kwild's RPC server). Managed services can also assist but are generally overkill. For instance, with the Kwil node RPC server (not the one required for statesync though) we can access it two ways presently:
I would be happy to share some suggested nginx config templates. Also, kwild could consider building in letsencrypt certificate requests/renewal, but it is somewhat complex and generally out-of-scope for node software. |
Hi @jchappelow, related to the state sync or not, feel free to share the knowledge 😉 Hi @brennanjl, I got an error message saying there is no snapshot available. Currently, the snapshot directory is located in:
Command used:
The error:
Another thing I tried, using DNS resulted in a context deadline error:
Both resulted in context deadline exceeded and panic error:
Is there anything I miss here? |
Regarding The |
Yes, it is true. We use the default value for @jchappelow how to confirm that the snapshot has been created or not? We have:
|
note: the one inside was created manually with |
more things to help debugging: Their cometbft rpc endpoints are available now at 26657 if this helps: http://staging.node-2.tsn.truflation.com:26657/ Node 1 Config Dump (working, but no automatic snapshot)
Node 2 Config dump (halted, no auto snap)
|
here are some logs for node 2, but it went through some filtering and tail to remove some noise: 2024-09-19-node2-tail-filtered.zip here's the complete filter process Filtersto filter out this:
used filtering:
with: filtering
with filtering
with filtering
well, I think we're not interested in request success messages now, right? with filtering
with then used tail, but first message from log here is from: |
other checks at node 2:
|
Heads up -- if given an relative path to I'm still investigating logs to see what might be cause the deadline exceeded. |
The directory where tsn daemon writes out snapshots for serving to peers looks a bit like this on the kwil node:
Given this config:
I have it set to make snapshots every 600 blocks because 14400 felt too long (about 24 hrs). Also, I got bit by the 0.8.x quirk with |
hey @jchappelow, thanks for that! I'll try the exact configs to see if this works. also, would it only create the snapshot once at 14400 multiples, or would it try immediately after passing some multiple, if it was just configured correctly? |
On preview/v0.8 it creates a snapshot at every 14400 multiple. We have since updated on main for 0.9 to make a snapshot immediately on startup if it doesn't already have one created. @charithabandi noticed something in your tail filtered logs: It looks like postgres was down/unreachable. This makes me wonder how it's configured. There are two things to check: if using docker, I'd suggest adding Slightly more concerning issue however:
Is it possible that this node may have sychronized prior to kwilteam/kwil-db@ad379ae? On a related note, I'm really looking forward to migrating when 0.9 is ready. 😅 |
@outerlook this is the exact same issue (same apphash mismatch) as noted here: trufnetwork/node#609 (comment) You need to re-sync your node, because for some reason it has bad state. There's a very good chance that it synced bad state previously via statesync. On Friday, in Slack, you said that you synced your nodes using this pr. However, this PR had an issue with statesync (it wouldn't sync all state); this is why we had to add this pr for the v1.1.9 release. But if you didn't wipe your node's state in between using these two commits, then you would in fact be at an invalid state (which explains why the node fails immediately). Following the commit history of TSN:
If you still had the state you synced using the ffa5f44 commit, then it would explain why you are in a bad state and failing to sync. |
Thank you! That explains. I'll wipe data on that node and restart it via statesync. I'll also add a lower value for snapshot recurring height |
The bug that caused the halt a couple days ago (mentioned here #17) appears to not be fully resolved. While node operators that were online at the time of the halt have been able to proceed as usual, new nodes that are syncing the network halt the block at which the error occurred. This has been noted by two separate joiners (see here #20) and has been reproduced by the Kwil team.
The Problem
The bug that is causing new nodes to fail to sync, while related to the previous halt, is a different bug. The previous halt seemed to cause CometBFT (the consensus engine that Kwil uses) to end up in an unexpected state while processing the byzantine validator punishment. The state is stored and managed by CometBFT, and thus isn't directly modifiable even by the Kwil team. We are working with the CometBFT team to further diagnose this issue and create a fix. In the meantime, we have created a fix that will allow node operators to join the network.
Immediate Fix: State Sync
The quickest and easiest way to fix this issue is to use statesync to allow new nodes to sync from snapshots. This is a band-aid fix, but it should allow onboarding operators to continue as normal.
In order for this to happen, new nodes will have to sync from nodes that are creating state snapshots. Kwil's node is already doing this. Since most new nodes use TSN's as the bootstrap nodes, we would also recommend that the TSN team has their validators do this (CC: @outerlook @MicBun). It is not necessary for other validators to configure this, but they are free to. To run
kwild
in snapshot mode, please refer to the docs. It shouldn't take more than a couple minutes to configure.New nodes joining the network will also need to run the new
v1.1.9
release. New node operators should configure snapshots on their nodes to connect properly. All new nodes should now use the following setup when running for the first time:Step 1: Ensure
psql
is installedNodes should have the
psql
command line utility installed.Step 2: Wipe All Application State
If you were previously running a node that failed to sync, make sure you delete all of your data. This means deleting all of your data from Postgres, as well as from
~.kwild/abci
,~.kwild/abci
, andrcvdSnaps
(some of these might not exist depending on your prior configuration, in which case that is ok).Step 3: Run the node with the statesync config flags
kwild --root-dir ./my-peer-config/ --chain.statesync.enable=true --chain.statesync.rpc-servers='http://3.92.83.167:26657'
@outerlook @MicBun we may want to update TSN's docs for this. The above uses Kwil's node for syncing snapshots, but once you have TSN's nodes running this, you can replace it with your own endpoints if you want.
Our team will follow up with a plan for a longer term fix to re-enable blocksync once we have solved the root CometBFT issue.
EDIT: I added that you need to wipe all of your local state, and have
psql
installed. This was a step that I overlooked that many ran into within Discord.The text was updated successfully, but these errors were encountered: