Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optizmize base migrations #285

Open
rjan90 opened this issue Jul 3, 2024 · 16 comments
Open

Optizmize base migrations #285

rjan90 opened this issue Jul 3, 2024 · 16 comments
Assignees

Comments

@rjan90
Copy link
Contributor

rjan90 commented Jul 3, 2024

When benchmarking the nv23 migration on my mainnet node, I get these bench-times:

Migration without cache is taking 41seconds:

migration height  4056425
old cid  bafy2bzaceafswlgag4nowendbxzwxvvt4ofn4vyp75kbmdods4x2vbsepgt5q
new cid  bafy2bzacecparpmzboywlcpuqz5vp2sn47lk4d3bc24oksvynlbfs4v7vngwy
completed round actual (without cache), took  41.730056334s
manifest cid: bafy2bzacecbueuzsropvqawsri27owo7isa5gp2qtluhrfsto2qg7wpgxnkba

Migration with cache is taking 36 seconds which is higher then what our threshold for migration times - which should be less then 1 epoch (30s)

completed round actual (with cache), took  36.853138672s
  • These high times are unexpected given that the migration for network upgrade is very light.
  • Resource usage when running the migration on this node is low, its consuming around 10GiB.
@ZenGround0
Copy link
Contributor

All right here's a high level summary of time spent. For more precise understanding of the code corresponding to each log line check out the timing branch here: master...spike/timing-migration
Timing without cache

./lotus-shed migrate-state --repo=/mnt/nvmeraid0/daemon/ 23 bafy2bzacedgsnt7hkqwisxxfcndalrf4claltpyel6k6hixqjyhikg53puq6u
2024-07-03T19:03:27.924Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to initialize migration: 461.255µs
2024-07-03T19:03:49.019Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to run migration: 21.095439755s
2024-07-03T19:03:49.019Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to run f090 migration: 44.545µs
2024-07-03T19:03:51.295Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to flush actorsOut: 2.275956678s
completed round actual (without cache), took  24.891692657s

Timing with cache

2024-07-03T19:03:52.941Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to initialize migration: 342.139µs
2024-07-03T19:04:24.281Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to run migration: 31.339779092s
2024-07-03T19:04:24.281Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to run f090 migration: 34.505µs
2024-07-03T19:04:26.604Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to flush actorsOut: 2.323670911s
completed premigration, took  35.326578544s
2024-07-03T19:04:28.152Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to initialize migration: 250.044µs
2024-07-03T19:04:49.162Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to run migration: 21.009924062s
2024-07-03T19:04:49.162Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to run f090 migration: 29.727µs
2024-07-03T19:04:51.468Z	WARN	fil-consensus	filcns/upgrades.go:2743	time to flush actorsOut: 2.305785299s
completed round actual (with cache), took  25.048051787s

@ZenGround0
Copy link
Contributor

ZenGround0 commented Jul 3, 2024

It's clear the main issue is the baseline migration of code cids. It's unclear why this is slower than expected. Also memory usage seems about half of what I'd expect (8 bytes for actor cid * ~3_000_000 actors ~24 GiB). Some hypotheses are -- 1) caching isn't as good as it should/could be 2) we're hitting some internal inflection point in go map performance vs size since there are more actors in the state tree.

At this point I'm suspicious we're not going to be able to make this significantly faster by tomorrow. But I'll keep looking and run a pprof trace.

@ZenGround0
Copy link
Contributor

nv23-migration-mainnet.zip
Im not great at parsing pprof diagrams but heres both cached and uncached. I think I'll try to isolate out only cached

@ZenGround0
Copy link
Contributor

From what I can see it looks like go GC is the expensive thing

@ZenGround0
Copy link
Contributor

ZenGround0 commented Jul 3, 2024

Watching htop I see a max memory of 15 GB and it looks like the cache is taking 11GB this seems off by 2 so I'll look into measuring cache size.

--edit--
Measured cache size, it grows to > 3M the same number as migration jobs, nothing is going wrong there though I wonder why memory overhead is so low?

--edit 2--
I never figured this out, maybe some cids are identity cids? maybe some state head cids are the same and go reuses internally?

@ZenGround0
Copy link
Contributor

ZenGround0 commented Jul 3, 2024

Analysis

Here is a reference to the testing time of the nv21 migration: https://filecoinproject.slack.com/archives/CP50PPW2X/p1693544595593359

This shows migration time with cache ~20s. @rjan90 shared a lower hardware spec at ~37s with cache above but on a good hardware spec machine we see ~24s with cache for the nv23 migration.

Note that between the time of that dry run and now the network has grown by ~640,000 actors or about 25% from its ~2500000 actor size at the time. Under the assumptions that the ~20s nv21 migration with cache was 1) bottlenecked by cache reads and not redoing actual migration work 2) the nv21 dry run was on good hardware (I remember it being run last year and Im confident about this) then our 24s migration with cache time lines up very well with the model of a linear increase in time for each actor. This is a model we should expect since the top level migration traverses all actors. Therefore the slowdown is expected.

Solutions

Smarter use of GC

As shown in the pprof profile above the CPU bottleneck for running this migration is all in GC and memory management. This seems suboptimal because the migration did not take up much memory (~11 GB for cache ~15 high water mark). Node operators have a lot of memory so if we could figure out how to trade off more memory for less time this would be ideal. I.e. turn off GC completely for data alloced from migration go routines.

My initial experiments setting GOGC = 400 and GOGC = off did not improve migration time and I have reached the limits of my understanding of go GC to help solve this. But in principle it appears we could engineer a more memory intense trade off.

Top level HAMT diff

To get sublinear we have to stop traversing every actor. Fortunately this is possible by caching nodes in the top level HAMT and skipping re-migration of the subtrees with no changes. I expect this to speed things up significantly in practice. However it is a big departure from the migration model that we've used for around 4 years so we'll want to proceed carefully. IMO we should not try to do this without significantly slipping the nv23 deadline

What should we do

The migration times are not ideal but probably safe to run on mainnet. Our two options are

  • postpone the release until we get lower migration time (~1-2 weeks of engineering work)
  • accept 25 - 40s migration times

@ZenGround0
Copy link
Contributor

ZenGround0 commented Jul 3, 2024

FYI @Stebalien I'm sure you have thoughts about both enabling more aggressive memory use and HAMT diffing solutions and possibly other ideas not yet considered.

@ZenGround0 ZenGround0 removed their assignment Jul 4, 2024
@rjan90 rjan90 changed the title Optizmize NV23-migration Optizmize base migrations Jul 4, 2024
@rvagg
Copy link
Member

rvagg commented Jul 8, 2024

@ZenGround0 can you explain what the migration is doing by touching pretty much all the actors? I was running it with some debug printing and seeing it touch ~6M of them and I'm not sure why. I understand that the actors code CIDs have changed, but why are we going through apparently all of the on-chain actors here: https://github.com/filecoin-project/go-state-types/blob/master/migration/util.go#L212

@ZenGround0
Copy link
Contributor

but why are we going through apparently all of the on-chain actors here: https://github.com/filecoin-project/go-state-types/blob/master/migration/util.go#L212

Ok nice this is just code that should not be used for code cid migrations. The run down is this code was written long pre-FVM phase I when all actors we migrated needed state migrations.

I was running it with some debug printing and seeing it touch ~6M of them and I'm not sure why.

I think you are seeing 6M cache hits? If so this seems to make sense given the unnecessary migration of all actor heads. 3M actors + 3M state heads.

This is a great find that I missed. It should give us 2x performance which puts us at parity with Forest which makes the world make more sense. I'm going to work on a PR for this today in case its not too late to get it in.

@ZenGround0
Copy link
Contributor

ZenGround0 commented Jul 9, 2024

Ok furthermore here's an interesting corallary.
-- We don't do any caching of top level actors --

The cachedMigrator around the code cid migrator is purely bad. It costs us an unnecessary cache hit on the state head AND this it does no caching of the migrated actor.

Furthermore I think the only way to correctly cache top level actors is at the level of HAMT nodes since actors are not stored on disk as single objects so we don't save on disk writes by caching them. The correct way to do this is with HAMT diffing mentioned above. Anything else is more work for less payoff.

The only speedup that premigrations give us for a base migration are in writing HAMT nodes to disk already so that the second time around badger doesn't need to actually write it can just noop when we give it a top level actor HAMT node that didnt change.

@ZenGround0
Copy link
Contributor

ZenGround0 commented Jul 9, 2024

ok -- pretty modest gains, idk if its worth rereleasing:

Before change:

completed premigration, took  34.522901313s
completed round actual (with cache), took  24.012397491s

After change

completed premigration, took  29.261960129s
completed round actual (with cache), took  20.52595699s

15% improvement on the part that matters

This makes sense -- we halved 10s of time accessing go maps and theres still 15s of time reading HAMT from disk that we have to pay. This also makes me wonder how much better the actual migration is in a lotus node with a hot actor HAMT cache from syncing the chain -- probably not a lot since few actors actually change at the head.

@BigLep
Copy link
Member

BigLep commented Jul 9, 2024

@ZenGround0 : will you put up a PR for this work so it's there for the future, even if it's not leveraged for nv23 (which is TBD, but I assume we'd include it if we're going to do an additional release anyways)

@ZenGround0
Copy link
Contributor

#289

@rvagg
Copy link
Member

rvagg commented Aug 2, 2024

I've been spending a little bit of time looking at this, measuring, tweaking and trying different things. If you strip away a lot of the multi-worker complexity so you don't have the synchronisation overhead and can see where the rest of the time is spent, it ends up looking something like this: 35% iterating through existing actors tree (around 45% of this is just badger loading blocks), 15% performing mutation and building up a new actors tree, 10% flushing that actors tree to disk (badger again), and the remaining 40% is spent on dealing with gc concerns because we're doing so many small allocations in the process of serialisation and deserialisation.

@rvagg
Copy link
Member

rvagg commented Aug 5, 2024

@ZenGround0 can you expand on how you see this working?

Fortunately this is possible by caching nodes in the top level HAMT and skipping re-migration of the subtrees with no changes.

From what I understand, we need to migrate the entire actors HAMT of >3M entries because all entries need to be updated because they all get a new Code. So we can't avoid creating an entirely new HAMT.

The only way I see diffing helping with this is with a pre-migration. Where we perform the full rewrite at one tipset some time earlier than the real migration. Then during the real migration we diff the actors tree between the premigration and current epoch and only mutate the entries new HAMT we made at premigration that have changed since then.

Is this what you had in mind, or something else?

@ZenGround0
Copy link
Contributor

The only way I see diffing helping with this is with a pre-migration
Is this what you had in mind, or something else?

Yes this allows us to make the migration cheap during the part that matters. As you are saying there is no avoiding the pre migration taking a longer time without resorting to more drastic measures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ⌨️ In Progress
Development

No branches or pull requests

5 participants