Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres backend - LND v0.14.1-beta "lnd compatibility check failed" #78

Open
miketwenty1 opened this issue Dec 22, 2021 · 8 comments
Open

Comments

@miketwenty1
Copy link

Need help understanding what's going on with my setup or if this is a bug.

Note, currently running lndmon for many nodes using the standard bbolt/boltdb backend.
For some reason it seems like I'm getting errors when using LND with postgres.

logs:

2021-12-22 02:39:55.978 [INF] LNDMON: Starting Prometheus exporter...
2021-12-22 02:39:55.978 [INF] HTLC: Starting Htlc Monitor
2021-12-22 02:39:55.979 [INF] LNDMON: Prometheus active!
Lndmon exiting with error: GraphCollector DescribeGraph failed with: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2021-12-22 02:40:35.757 [INF] HTLC: Stopping Htlc Monitor
2021/12/22 02:40:35 Stopping Prometheus Exporter
GraphCollector DescribeGraph failed with: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Sometimes I'll just get this for the error in the logs:

lnd compatibility check failed: unable to get info for lnd node: rpc error: code = DeadlineExceeded desc = context deadline exceeded
@guggero
Copy link
Member

guggero commented Dec 22, 2021

Sounds like the request is just timing out. lndmon uses the default RPC timeout of 30 seconds. Does it take longer than 30 seconds to call lncli getinfo on the postgres lnd?

@miketwenty1
Copy link
Author

@guggero the response is nearly instant when I do a lncli getinfo. Let me know what else I should test.

@guggero
Copy link
Member

guggero commented Dec 22, 2021

Ah, I looked at the wrong error message. Seems like DescribeGraph fails, not GetInfo. Can you try if the error goes away by adding --caches.rpc-graph-cache-duration=5m?
You might need to fill the cache initially with lncli describegraph, then the lndmon calls should be answered almost immediately.

@miketwenty1
Copy link
Author

miketwenty1 commented Dec 22, 2021

You're recommending I run lncli describegraph to cache for 5m instead of default of 1m on bootup of LND?

I ran LND with this config, I then ran the lncli describegraph, right afterwards if I start lndmon it will return as a healthy prometheus target, but after a bit of time it crashes with the same error.

Something to note in terms of latency:

  • It took 2 minutes and 39 seconds to respond to my lncli stop command, when I was bringing this node down for the cache update.
  • it took 1 minute and 50 seconds to run the lncli describegraph command, after I booted with new cache config.

Not sure if this would warrant a ticket in the lightningnetwork/lnd repo?

@guggero
Copy link
Member

guggero commented Dec 23, 2021

This is the same issue as lightningnetwork/lnd#6107 then. The in-memory graph is exactly the same data as is served in describegraph. If it takes multiple minutes to load it on startup then it will take multiple minutes to scrape from the RPC, unless the RPC graph cache is turned on. But every time the graph cache expires, the first scrape will take that long again.

I see two ways to fix this (indirectly, the main fix will be to speed up the graph download in postgres): Set the rpc-graph-cache-duration to an infinitely long time (e.g. 8760h which is one year) to disable updating the graph data in lndmon.
Or increase the default RPC timeout (must be added to this struct: https://github.com/lightninglabs/lndmon/blob/master/lndmon.go#L41) and the scrape interval to something larger than the 1 minute 50 seconds it takes to load the graph.

@miketwenty1
Copy link
Author

Why is this only happening with postgres backend?

@guggero
Copy link
Member

guggero commented Jan 3, 2022

Why is this only happening with postgres backend?

Not sure what you mean... context deadline exceeded is Golang's way of saying "something timed out". So the error is because the DescribeGraph call takes too long with postgres.

@sandipndev
Copy link

Looks like this is happening on postgres and not bbolt, can reproduce. getinfo took 2m4s to respond.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants