Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of Filecoin.StateGetBeaconEntry for old beacons #12527

Closed
6 of 11 tasks
LesnyRumcajs opened this issue Sep 30, 2024 · 11 comments
Closed
6 of 11 tasks

Poor performance of Filecoin.StateGetBeaconEntry for old beacons #12527

LesnyRumcajs opened this issue Sep 30, 2024 · 11 comments
Labels
kind/bug Kind: Bug

Comments

@LesnyRumcajs
Copy link
Contributor

LesnyRumcajs commented Sep 30, 2024

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus fvm/fevm - Lotus FVM and FEVM interactions
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt/WinningPoSt)
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

❯ ./lotus version
Daemon:  1.29.2-rc1+calibnet+git.4004ca691+api1.5.0
Local: lotus version 1.29.2-rc1+calibnet+git.4004ca691

Repro Steps

curl --silent -X POST -H "Content-Type: application/json" \
             --data '{"jsonrpc":"2.0","id":2,"method":"Filecoin.StateGetBeaconEntry","params": [ 1 ] }' \
             "http://127.0.0.1:1234/rpc/v1"

First run takes well over a minute (91s on my machine)

Describe the Bug

It is a regression from old performance (pre #12414 and pre change that lead to creation of this issue), where it took significantly less time to get that beacon information (the results are identical).

Logging Information

Lotus v1.29.2-rc1

❯ time curl --silent -X POST -H "Content-Type: application/json" \
             --data '{"jsonrpc":"2.0","id":2,"method":"Filecoin.StateGetBeaconEntry","params": [ 1 ] }' \
             "http://127.0.0.1:1234/rpc/v1"
{"id":2,"jsonrpc":"2.0","result":{"Round":2396511,"Data":"okAr/DU2fJ9WFpyqunhK3kkI6Z6dG5uZqBk+kfcEYXSTYIbHLL/y5ka0ZRopqAdEGAfODlP4wUaYnWh0c1COMWBtkfFGXBWGc15oU8LTDs2oZB9Sr/DNkfTQLJD9BtcA"}}

________________________________________________________
Executed in  184.56 secs      fish           external
   usr time    7.72 millis    0.00 micros    7.72 millis
   sys time    4.42 millis  566.00 micros    3.86 millis

Lotus v1.28.0 in comparison:

❯ time curl --silent -X POST -H "Content-Type: application/json" \
             --data '{"jsonrpc":"2.0","id":2,"method":"Filecoin.StateGetBeaconEntry","params": [ 1 ] }' \
             "http://127.0.0.1:1234/rpc/v1"
{"id":2,"jsonrpc":"2.0","result":{"Round":2396511,"Data":"okAr/DU2fJ9WFpyqunhK3kkI6Z6dG5uZqBk+kfcEYXSTYIbHLL/y5ka0ZRopqAdEGAfODlP4wUaYnWh0c1COMWBtkfFGXBWGc15oU8LTDs2oZB9Sr/DNkfTQLJD9BtcA"}}

________________________________________________________
Executed in  277.03 millis    fish           external
   usr time    6.21 millis    0.00 micros    6.21 millis
   sys time    4.93 millis  771.00 micros    4.16 millis


@rvagg
Copy link
Member

rvagg commented Oct 1, 2024

OK, wow, that's worth a look, but I suppose it's to do with tipset lookup

@rvagg
Copy link
Member

rvagg commented Oct 1, 2024

I can't replicate this unfortunately @LesnyRumcajs, first run or otherwise:

$ lotus -version
lotus version 1.29.2-rc1+mainnet+git.4004ca691
$ uname -a
Linux fletcher 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":[1],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
81086
Took: 0:00.09s

In fact, I can run it with any random epoch and it's all pretty quick:

$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
2679701
Took: 0:00.04s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
2498234
Took: 0:00.04s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
1920535
Took: 0:00.05s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
3723317
Took: 0:00.02s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
3639175
Took: 0:00.02s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
2753507
Took: 0:00.04s
$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":['$(shuf -i 1-4000000 -n 1)'],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
851290
Took: 0:00.08s

I had to upgrade to rc1 to get this going, so it's fairly soon after running. And I'm running splitstore on this so only have a ~170G blockstore. But it looks like you're running calibnet so I imagine it's not huge, or is it a non-splitstore calibnet node perhaps? Or is this a particularly slow machine? What might be the difference here?

@rvagg
Copy link
Member

rvagg commented Oct 1, 2024

ahh, restart my node and call it as soon as it's live:

$ /usr/bin/time -f 'Took: %Es' curl -s -X POST -H "Content-Type: application/json" --data '{"method":"Filecoin.StateGetBeaconEntry","params":[1],"id":2,"jsonrpc":"2.0"}' http://127.0.0.1:1234/rpc/v1 | jq .result.Round
81086
Took: 4:05.58s

I must have waited too long after restart the first time and something had the chance to prime the tipset cache.

@LesnyRumcajs
Copy link
Contributor Author

Yeah, it's the issue after startup. For context, the setup in Forest's parity tests is:

  1. Wait for Lotus and Forest to sync (in parallel)
  2. Call methods and compare results.

Given that, Lotus started to timeout (over 120s on a regular GHA runner) for this method since v1.29.2-rc1, which was fine in the past (excluding some minor versions in-between where the method was broken).

@rvagg
Copy link
Member

rvagg commented Oct 2, 2024

The conclusion from a brief investigation and internal chat about this is that it comes down to hydrating the tipset cache, which is only done on demand. So if your call to fetch tipset 1 is early enough then you're going to trigger that process and it involves a full walk-back from the known-good head and a caching of a skip-list in the process. Once that's done it's fairly speedy but of course you have the problem of wanting to do it pretty early.
I'm not sure what we can do about this in the short term, but given that mainnet is now 4.3M tipsets long, this is one of those growth problems that probably should be looked at soon. Fortunately though we don't have a lot of users reaching that far back to early tipsets after the start of their node. Your case is pretty unique.

A possible solution might be to lump this into the ChainIndexer work, where we're rolling up the 3 (optional) sqlite indexes of chain data into a single entity, which can be extended to index other parts of the chain, or state. Having to re-parse the entire chain each restart isn't ideal, but having a trusted index that could be used immediately and checked occasionally, might be a good approach here.

For now, could you cheat a bit on this one? If you're doing a whole bunch of compatibility checks, could you send a request for the epoch 1 beacon at the beginning of your checks and let it run in the background (ignoring the result but letting it process) while you do the rest of the checks and then do the beacon checks right at the end? So you get lotus starting on that tipset walk process in parallel to the other tests you're running?

@LesnyRumcajs
Copy link
Contributor Author

LesnyRumcajs commented Oct 2, 2024

I mean, it's fine, our current workaround is to just check a more recent epoch.

Could you please help me understand why the current solution takes so much time, whereas the previous one (v1.28.0) didn't? I'm curious, because a similar change might need to be implemented in Forest as well (or not), which does this, currently, relatively fast (code for reference)

@rvagg
Copy link
Member

rvagg commented Oct 2, 2024

I think you're doing a call out to drand here, right? https://github.com/ChainSafe/forest/blob/5fd1ec506ee7db4d83fa8aca14338429bf11938c/src/rpc/methods/state.rs#L1755

Previously Lotus did this, but the client configs were removed for mainnet, so we only have client configs for quicknet.

There's a couple of issues with this:

It doesn't scale, because it requires drand to be maintaining those networks in perpetuity, which isn't the case. Prior to Filecoin mainnet launch (proper), between epochs 0 and 51000, we used the "incentinet" drand network, which has apparently been down for years now, so requesting those beacons isn't going to work anyway (and the incentinet client configs were removed long ago from Lotus anyway, so asking for those epochs would have returned an error like the one you experienced that raised this issue in the first place!). You don't notice this in your testing because CalibrationNet is newer than incentinet so only uses drand mainnet and now quicknet. Eventually drand mainnet is going to go away too, so those beacons will stop responding too.

The other problem is that we have a mismatch on some epochs between the drand epoch and the filecoin epoch. It only got super strict by NV15, and during NV14 it was off-by-one. So we've had to decide what StateGetBeaconEntry is supposed to do - get the precise drand beacon entry for that filecoin epoch? Or get that drand beacon entry that would have been used for that epoch because it got baked into the blocks? We've chosen to go with that approach, because it also matches what would be used if you were to use get_beacon_randomness from a smart contract. The exact historical behaviour was summarised in filecoin-project/FIPs#1054 after I ended up having to work through it all.

So, if you want to match Lotus precisely on this API you're going to have to fish around in tipsets to find beacon entries using that same algorithm for historic look-ups. Then if you want to be even more comprehensive in testing compatibility you could look at null and non-null rounds within each of the main periods of beacon history: pre-nv13, nv13, nv14, post-nv14 and then post-quicknet which was 120 epochs after nv23. Probably unnecessary, however, given FIP-0095, it's now possible for smart contracts to reach back in history for these beacons, so it'll end up being a consensus issue if Forest is providing different entries to Lotus and someone writes a contract that goes back into these periods. (Also depends on how you're exposing the beacon lookups to that syscall, maybe that's different to how you're serving StateGetBeaconEntry?).

@LesnyRumcajs
Copy link
Contributor Author

@rvagg Thanks a lot for the comprehensive explanation. We are indeed using a different approach for StateGetBeaconEntry and what we provide for FVM and, likely, we should match the behaviour (at the cost of perf, but well, better slow and correct than fast and wrong).

@rvagg
Copy link
Member

rvagg commented Oct 3, 2024

I'd be interested to know if you come up with an efficient and dependable way to establish the epoch->tipset mapping, both on startup and continuously.

@rjan90
Copy link
Contributor

rjan90 commented Oct 22, 2024

@rvagg It seems like you have addressed the reason why the performance of Filecoin.StateGetBeaconEntry might be slow. Do we still want this ticket to be held open, or should we open another ticket for exploring the epoch->tipset mapping and link to this for context?

@rvagg
Copy link
Member

rvagg commented Oct 23, 2024

This ticket is going to be the best to track the resolution for the concern here: #12568

@rvagg rvagg closed this as completed Oct 23, 2024
@github-project-automation github-project-automation bot moved this from 📌 Triage to 🎉 Done in FilOz Oct 23, 2024
@rjan90 rjan90 moved this from 🎉 Done to ☑️ Done (Archive) in FilOz Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Kind: Bug
Projects
Status: ☑️ Done (Archive)
Development

No branches or pull requests

3 participants