Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server resets its latest block from time to time #400

Closed
hhanh00 opened this issue Jul 7, 2022 · 13 comments
Closed

Server resets its latest block from time to time #400

hhanh00 opened this issue Jul 7, 2022 · 13 comments
Labels
bug Something isn't working

Comments

@hhanh00
Copy link

hhanh00 commented Jul 7, 2022

What is the bug?

Occasionally, but not rarely, the server reports a very low height.
For example:

$ grpcurl mainnet.lightwalletd.com:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock
{
  "height": "622062",
  "hash": "vhlJK+qM9RkcHwfds7TYnTuoLtCwDaBOGvp6AQAAAAA="
}

Additional context
The height returned by the server info call is accurate though. It looks like lightwalletd is performing a deep reorg.
After a while, it gets back in sync.

@hhanh00 hhanh00 added the bug Something isn't working label Jul 7, 2022
@skyl
Copy link

skyl commented Jul 9, 2022

I launched a server using master branch at db2795a
and I have yet to experience this. mainnet.lightwalletd.com:9067 seems to have been working for about ~24 hours now as well. I'm watching to see if this happens again .. I did observe it a couple of days ago on mainnet.lightwalletd.

» grpcurl mainnet.lightwalletd.com:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock
{
  "height": "1732043",
  "hash": "RgDuhFy+411fXeOkNlmZtSxZtdt07KJU9u1HAQAAAAA="
}
» grpcurl zuul.free2z.cash:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock
{
  "height": "1732043",
  "hash": "RgDuhFy+411fXeOkNlmZtSxZtdt07KJU9u1HAQAAAAA="
}

@skyl
Copy link

skyl commented Jul 25, 2022

mainnet.lightwalletd.com is currently reset to a low height again

$ grpcurl mainnet.lightwalletd.com:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock
{
  "height": "488166",
  "hash": "VTzEONp5y4D7QBzt8RNgr3rVw8xMlaaRYHYvAAAAAAA="
}
$ grpcurl zuul.free2z.cash:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock

{
  "height": "1749543",
  "hash": "j6/Z2kpvQFLMe0F8gl1MJyXFX3HvEj0X6c60AQAAAAA="
}

@LarryRuane
Copy link
Collaborator

Please post here the result of GetLightdInfo when you're seeing this problem.

I haven't seen this myself, and I can't think of what could cause this problem. Can you provide more detail? Let's say you were running GetLatestBlock from a loop every few seconds to get a stream of samples. Would it report the correct height many times in a row, and then on the very next sample, report a very low height, and then on the next sample, back to the correct height? Something like that?

Or, after reporting the very low height, would it slowly return to the correct height (like maybe advancing several hundred blocks per second)?

In other words, I'd like to understand the timing behavior. That would help narrow down the source of the problem. Thanks!

@hhanh00
Copy link
Author

hhanh00 commented Jul 26, 2022

Sorry, I forgot to post GetLightInfod results though I tried it. The server height reported is the correct one, i.e., Getlastblock resets but not GetLightInfod.

Getlastblock indeed slowly goes back up as if block ingestion is repeated. Eventually, the server is back on track. However the process takes a few hours.

It is hard to tell when it happens and it seems to fix itself after a while.

However, wallet users are reporting lost funds (they are not lost) because wallets think they are synced but actually haven't processed the later blocks.

@LarryRuane
Copy link
Collaborator

The server height reported is the correct one, i.e., Getlastblock resets but not GetLightInfod

Okay, but I was asking for the GetLightdInfo output to see which version was running. Oh but I just ran it myself (should have thought of that!), and I see it's now at v0.4.13, which is the latest version.

My guess is that this is some sort of infrastructure problem, rather than in the lightwalletd code itself. We (ECC) don't run the mainnet.lightwalletd.com instance, so it's unclear what they're doing. The lightwalletd server creates a (compact) block cache in the form of two disk files, here is the default location (for mainnet):

$ ls -l /var/lib/lightwalletd/db/main
total 2358072
-rw-r--r-- 1 larry larry 2409336991 Jul 22 12:12 blocks
-rw-r--r-- 1 larry larry    5310024 Jul 22 12:12 lengths
$ 

If something outside of lightwalletd deletes or corrupts these files, the result would be what you're seeing. There's no way for us to know if that's actually what's happening, I'm just saying it's something like this. Maybe the file system holding these files runs out of space? Although I think lightwalletd would just halt; I don't know why it would back up many blocks.

Who runs that instance, isn't it Nighthawk wallet? Do they have a GitHub repo that you can open a ticket on for this?

@hhanh00
Copy link
Author

hhanh00 commented Jul 28, 2022

I contacted him on telegram and he assured me that the ECC was working actively on it as a top priority. I thought that you work at the ECC?

@LarryRuane
Copy link
Collaborator

I contacted him on telegram and he assured me that the ECC was working actively on it as a top priority. I thought that you work at the ECC?

Yes, I do, but I'm not aware of any bug that makes the block height back up like that. There have been a few fixes lately, maybe you can try it again and make sure it's still happening (if you don't mind).

@hhanh00
Copy link
Author

hhanh00 commented Jul 28, 2022

From a client's point of view, it happens randomly. But when it does, there is no clear indication that something is wrong. I could check both GetLastBlock and GetLightdInfo to see if they match but it seems to be a kludge since it is not supposed to happen anyway.

Btw, a user reported an issue 2 days ago, so I think it's still happening.

@LarryRuane
Copy link
Collaborator

You definitely should not need to call both, I'll dig into this and try to figure out what's going on.

@LarryRuane
Copy link
Collaborator

I just noticed that mainnet.lightwalletd.com seems to be down. I was going to say that I haven't had any luck seeing this problem. For the last several days, I've been continuously running this command:

while sleep 10;do grpcurl mainnet.lightwalletd.com:9067 cash.z.wallet.sdk.rpc.CompactTxStreamer/GetLatestBlock ;done 2>/dev/null | jq '.height|tonumber|select(. < 1753466)'

This checks every 10 seconds to see if the reported block height drops below 1753466 (just what the tip height happened to be when I started the script), and it hasn't printed anything (meaning no block was reported below that height). But that server has been down, since I don't know when. I just started the script to my local instance, just to see if I can reproduce it that way.

@LarryRuane
Copy link
Collaborator

LarryRuane commented Aug 1, 2022

Update, it's been running for about 16 hours, and no block height resets have occurred.

There's been some speculation that this could have been caused by #408 (which should be fixed by #409, awaiting review), the idea being that if the lightwalletd doesn't cleanly shut down, the blocks cache could be in a strange state that required resynchronizing from a long way back in the chain. This is superficially plausible. But I don't think it's right. I just killed my lightwalletd without giving it a chance to cleanly shut down, and when I restarted it, it didn't reset (resync many blocks), it picked up exactly where it was when I killed it. (And, knowing how the code works, this is what I would expect.)

@LarryRuane
Copy link
Collaborator

Is this problem still being seen? I haven't made any progress in reproducing it.

@hhanh00
Copy link
Author

hhanh00 commented Aug 11, 2022 via email

@hhanh00 hhanh00 closed this as completed Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants