Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move DoStateCheckpoint off critical path #15411

Merged
merged 5 commits into from
Jan 14, 2025
Merged

Conversation

msmouse
Copy link
Contributor

@msmouse msmouse commented Nov 27, 2024

Description

  1. Add struct State to represent the speculative state, utilizing the LayeredMap / MapLayer
  2. Wrap the smt in StateSummary, to represent the speculative state tree. StateStorageUsage is removed from the smt and put in State.
  3. The LayeredMap / MapLayer stack can read diffs between versions by following in mem links, hence we were able to remove the hashmaps to track the updates between versions.
  4. Proof read is not removed from the execution stage so we save some CPU by only reading them for the writes, not the reads.

Todo: move it further to a separate stage.

Throughput

  1. All single-node-performance cases have seen improvements. See the last commit.

  2. On a 100M account DB and t4d 60 cores machine, apt-fa-transfer (with native vm, redoing for aptos vm) saw gain from 32k TPS to 40k TPS:

$ for suffix in main main final
do RUST_BACKTRACE=1  PUSH_METRICS_NAMESPACE=alden-benchmark-100Mp2p ./aptos-executor-benchmark.$suffix \
--execution-threads 32 \
--block-executor-type native-vm-with-block-stm \
--transactions-per-sender 1 \
--generate-then-execute \
--block-size 8086 \
--enable-storage-sharding \
run-executor \
--enable-feature NEW_ACCOUNTS_DEFAULT_TO_FA_APT_STORE \
--enable-feature OPERATIONS_DEFAULT_TO_FA_APT_STORE \
--transaction-type apt-fa-transfer \
--transaction-weights 1 \
--module-working-set-size 1 \
--main-signer-accounts 20000 \
--additional-dst-pool-accounts 600000 \
--data-dir ~/work/data/100M \
--checkpoint-dir ~/work/data/cp \
--blocks 300
done
  1. with AptosVM, the same thing saw 24k -> 26.8k TPS improvement.

  2. creating 100 100M accounts DB takes 40 minutes vs 48 minutes from main branch code.

Latency

latency in forge didn't change much:

realistic_env_load_sweep
main: https://github.com/aptos-labs/aptos-core/actions/runs/12696454894
this: https://github.com/aptos-labs/aptos-core/actions/runs/12696477211

realistic_env_workload_sweep
main: https://github.com/aptos-labs/aptos-core/actions/runs/12696466689
this: https://github.com/aptos-labs/aptos-core/actions/runs/12696621130

In theory a block with both heavy execution (but not much reading) and writing to a lot of COLD state items will suffer latency-wise since we don't prefetch the proof anymore and as a result the disk IOs for proof fetching no longer overlap with the execution time, but at our current DB size a warm DB pretty much catches everything (through RocksDB and the filesystem). Explicit caching of hot items is coming so I'm not complicating things by adding pre-fetching back.

How Has This Been Tested?

existing coverage

Key Areas to Review

Type of Change

  • New feature

Which Components or Systems Does This Change Impact?

  • Validator

Copy link

trunk-io bot commented Nov 27, 2024

⏱️ 2h 56m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
rust-cargo-deny 20m 🟩🟩🟩🟩 (+7 more)
check-dynamic-deps 17m 🟩🟩🟩🟩🟥 (+7 more)
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 12m 🟩
rust-move-tests 9m
rust-move-tests 7m
general-lints 5m 🟩🟩🟩🟩 (+7 more)
semgrep/ci 5m 🟩🟩🟩🟩🟩 (+7 more)
check 3m 🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@msmouse msmouse force-pushed the 1125-alden-state-summary branch from 0eb8d0e to ddaade9 Compare November 28, 2024 21:58
@msmouse msmouse marked this pull request as draft November 28, 2024 22:11
@aptos-labs aptos-labs deleted a comment from graphite-app bot Nov 28, 2024
@msmouse msmouse force-pushed the 1125-alden-state-summary branch 8 times, most recently from 6c4828a to 6bc9a36 Compare December 3, 2024 03:31
@msmouse msmouse force-pushed the 1125-alden-state-summary branch 3 times, most recently from ae365d6 to 9e9964b Compare December 8, 2024 21:51
@msmouse msmouse added the CICD:run-execution-performance-test Run execution performance test label Dec 10, 2024
@msmouse msmouse force-pushed the 1125-alden-state-summary branch 4 times, most recently from 550a066 to d59dead Compare December 10, 2024 10:40
@msmouse msmouse added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Dec 10, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@msmouse msmouse force-pushed the 1125-alden-state-summary branch from 644d149 to b42b2c3 Compare January 10, 2025 19:11

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@msmouse msmouse force-pushed the 1125-alden-state-summary branch from b42b2c3 to 09823fe Compare January 13, 2025 17:46

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@@ -0,0 +1,9 @@
# Seeds for failure cases proptest has generated in the past. It is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

@msmouse msmouse force-pushed the 1125-alden-state-summary branch from 09823fe to 5633246 Compare January 14, 2025 00:32
@msmouse msmouse force-pushed the 1125-alden-state-summary branch from 5633246 to d21c2a6 Compare January 14, 2025 00:36
@msmouse msmouse enabled auto-merge (rebase) January 14, 2025 00:37

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on d21c2a6909f603961c5d5d4045423265a8fa73ae

two traffics test: inner traffic : committed: 14975.83 txn/s, latency: 2654.34 ms, (p50: 2700 ms, p70: 2700, p90: 2900 ms, p99: 3300 ms), latency samples: 5694120
two traffics test : committed: 99.98 txn/s, latency: 1486.55 ms, (p50: 1300 ms, p70: 1400, p90: 2400 ms, p99: 3000 ms), latency samples: 1740
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.561, avg: 1.439", "ConsensusProposalToOrdered: max: 0.290, avg: 0.287", "ConsensusOrderedToCommit: max: 0.301, avg: 0.292", "ConsensusProposalToCommit: max: 0.586, avg: 0.579"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.51s no progress at version 291 (avg 0.19s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.52s no progress at version 2850515 (avg 0.52s) [limit 16].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 6593fb81261f25490ffddc2252a861c994234c2a ==> d21c2a6909f603961c5d5d4045423265a8fa73ae

Compatibility test results for 6593fb81261f25490ffddc2252a861c994234c2a ==> d21c2a6909f603961c5d5d4045423265a8fa73ae (PR)
1. Check liveness of validators at old version: 6593fb81261f25490ffddc2252a861c994234c2a
compatibility::simple-validator-upgrade::liveness-check : committed: 16263.42 txn/s, latency: 2119.71 ms, (p50: 1800 ms, p70: 2000, p90: 3000 ms, p99: 6900 ms), latency samples: 529240
2. Upgrading first Validator to new version: d21c2a6909f603961c5d5d4045423265a8fa73ae
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7179.69 txn/s, latency: 4155.73 ms, (p50: 4800 ms, p70: 5100, p90: 5200 ms, p99: 5300 ms), latency samples: 133820
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7215.76 txn/s, latency: 4667.96 ms, (p50: 5000 ms, p70: 5100, p90: 5200 ms, p99: 5700 ms), latency samples: 245340
3. Upgrading rest of first batch to new version: d21c2a6909f603961c5d5d4045423265a8fa73ae
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6130.56 txn/s, latency: 4950.64 ms, (p50: 5800 ms, p70: 6100, p90: 6300 ms, p99: 6500 ms), latency samples: 115340
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6174.02 txn/s, latency: 5462.99 ms, (p50: 6000 ms, p70: 6100, p90: 6300 ms, p99: 6500 ms), latency samples: 212400
4. upgrading second batch to new version: d21c2a6909f603961c5d5d4045423265a8fa73ae
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 14012.63 txn/s, latency: 2061.96 ms, (p50: 2100 ms, p70: 2500, p90: 2700 ms, p99: 2700 ms), latency samples: 239720
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 14319.12 txn/s, latency: 2243.18 ms, (p50: 2200 ms, p70: 2600, p90: 2800 ms, p99: 3100 ms), latency samples: 461540
5. check swarm health
Compatibility test for 6593fb81261f25490ffddc2252a861c994234c2a ==> d21c2a6909f603961c5d5d4045423265a8fa73ae passed
Test Ok

@msmouse msmouse merged commit 2e85700 into main Jan 14, 2025
69 of 89 checks passed
@msmouse msmouse deleted the 1125-alden-state-summary branch January 14, 2025 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:build-images when this label is present github actions will start build+push rust images from the PR. CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR CICD:run-execution-performance-full-test Run execution performance test (full version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants