Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy two hosts for benchmarking nimbus-eth1 import #194

Closed
jakubgs opened this issue Aug 20, 2024 · 21 comments
Closed

Deploy two hosts for benchmarking nimbus-eth1 import #194

jakubgs opened this issue Aug 20, 2024 · 21 comments
Assignees

Comments

@jakubgs
Copy link
Member

jakubgs commented Aug 20, 2024

The development of nimbus-eth1 is ramping up, and for that reason we will need to perform benchmarking of the process of importing the network state data with validation from ERA files. Currently this process is not optimized, as it's in it's early stages of development, which means a full import of mainnet would probably take more than a week. Despite that we need to start measuring results in order to figure out progress in import process optimization.

This benchmarking will require two kind of tests on two hosts:

  • A long running test that lasts a week.
  • A "short" running test that lasts 24 hours.

Both of those will not finish, so they will have to be aborted, but the amount of blocks they are able to sync will be the measure of performance. These performance reports will have to be archived in some way, simplest way would be to commit them to a dedicated repository. In addition to the reports gained this way the import process will make available a /metrics endpoint which we can scrape with Prometheus.

The two hosts can be purchased from Hetzner as the hosts will not be using external connections. The storage required will need to be at least 2x the size of Mainnet ERA and ERA1 files, which is currently ~1 2B, so a 2 TB additional NVMe would suffice. Aside from that more than 16 GB of RAM and 4 cores is enough.

update as of 28 Oct

Short test must begin with a template DB which contains blocks from 20M since measuring the import process from these blocks is what matters to the nimbus team. Jacek to provide this template db.

The long test will begin with no template DB and is also an import only test, it usually takes around a week.

The goal is to measure time taken to complete import in both cases.

@siddarthkay
Copy link
Contributor

How urgently do we need the 2 hosts?
I could either get a cheaper host from auction via Hetzner

Screenshot 2024-08-22 at 5 27 30 PM

OR

I could get a dedicated host which would be comparatively more expensive
example : https://www.hetzner.com/dedicated-rootserver/ax52/

Screenshot 2024-08-22 at 5 29 36 PM

My assumption is that the host from Auction might take longer to get compared to the dedicated one.

@siddarthkay
Copy link
Contributor

After discussing with @jakubgs, I finally went ahead with the following

2 x Dedicated Server AX42
* Location: Finland, HEL1
* For Finland, HEL1, support is only available in English.
* Rescue system (English)
* 1 x Primary IPv4
* 1 x 2 TB NVMe SSD
* 8 Core CPU
* 64 GB DDR5 ECC RAM

Order Details :

Screenshot_2024-08-22_at_5 55 10_PM

Possible wait times :

Screenshot_2024-08-22_at_5 54 55_PM

@siddarthkay
Copy link
Contributor

These 2 AX42 hosts have been activated by Hetzner and currently boot into rescue system.
I'll bootstrap these hosts and add them to our inventory.

siddarthkay added a commit that referenced this issue Sep 4, 2024
This commit adds 2 hetzner AX42 hosts for eth1 benchmarking to our network.

related issue: #194
siddarthkay added a commit that referenced this issue Sep 5, 2024
This commit adds 2 hetzner AX42 hosts for eth1 benchmarking to our network.

related issue: #194
siddarthkay added a commit that referenced this issue Sep 5, 2024
This commit adds 2 hetzner AX42 hosts for eth1 benchmarking to our network.

related issue: #194
@siddarthkay
Copy link
Contributor

siddarthkay commented Oct 23, 2024

I will use bech-01 for the short 24 hour test and bech-02 for the long running 1 week test.

Next steps are as follows :

  • Get ERA1 files from nel-01.ih-eu-mda1.nimbus.mainnet and move them to bech hosts.
 [email protected]:~ % sudo du -hsc /docker/era1
 428G    /docker/era1
 428G    total
  • Get ERA files from nel-01.ih-eu-mda1.nimbus.mainnet and move them to bech hosts.
  • Geth nimbus-eth1 running on bench-01.he-eu-hel1.nimbus.eth1 and bench-02.he-eu-hel1.nimbus.eth1 and make sure the node is running as expected and nimbus_eth1_network is set to mainnet.
  • On bench-01 set up the template DB which contains 20m blocks.
  • On bench-01 set up a systemd timer to trigger an import of the state by passing --era1-dir and --era-dir and log the time taken for the short sync to complete. This process should be terminated if it crosses more than 24 hours and when termination happens we have to log the progress percentage of the import.
  • The terminating script on bench-01 should also replace the existing db with the "template db" so that when the short test is run again its run from the state of 20m blocks.
  • On bench-02 Set up a systemd timer to trigger an "import" of the state by passing --era1-dir and --era-dir and log the time taken for the long sync to complete. This process should be terminated if it crosses more than 1 week and when termination happens we have to log the progress percentage of the import.
  • The terminating script on bench-02 should also clean up the existing db so that when import is run again, it is run from a clean slate.
  • Discuss 1st few results with Jacek and eth1 team in Discord.
  • Implement a process to auto publish this results by commiting to github.

@jakubgs
Copy link
Member Author

jakubgs commented Oct 23, 2024

Sounds correct. Remember that the timer will have to do several things:

  1. Measure and save progress and time it took.
  2. Stop the nimbus-eth1 service.
  3. Purge already synced data.
  4. Restart the nimbus-eth1 service.

@siddarthkay
Copy link
Contributor

as per @arnetheduck :

syncing = creating a state from blocks, usually sourced from the network
import = a method of syncing that reads era files instead of sourcing the blocks from the network - it's the same blocks, just in a file instead of requesting from nodes on the network

what we want to measure is the performance of turning blocks into a state - using import for this purpose eliminates the networking aspect of it focusing on the block processing component

@siddarthkay
Copy link
Contributor

siddarthkay commented Nov 9, 2024

The short benchmark was run and it completed in ~ 12 hours

Nov  4 23:32:38 bench-01.he-eu-hel1.nimbus.eth1 nimbus-eth1-mainnet-short-benchmark[147056]: 
INF 2024-11-04 23:32:38.837+00:00 Imported blocks                            
blockNumber=21005282 blocks=1005281 importedSlot=10223616 txs=160021843 mgas=15219979.109 
bps=24.889 tps=4328.405 mgps=376.486 avgBps=23.451 avgTps=3732.972 avgMGps=355.050 elapsed=11h54m27s

This was however run on a RAID 0 setup between 3 drives.
A side effect of RAID 0 is that its only used in cases where enhanced performance is needed.

So further benchmarks will be run on devices that do not have RAID 0.
A risk of RAID 0 is that if any one of the 3 drives go down, the entire storage will be lost so it seems risky to continue down this path if we will frequently run benchmarks.

@siddarthkay
Copy link
Contributor

I made this github repo to hold results of exported csv for short benchmark test.
https://github.com/status-im/nimbus-eth1-benchmarks

The format of reports is still a work in progress, for now the systemd service just pushes the csv exported by the import process to this github repo.

I also made this github role for building of nimbus eth1 and cleaning up the bench-01 host and restarting the short benchmark from a clean state.
https://github.com/status-im/infra-role-nimbus-bench-eth1

@siddarthkay
Copy link
Contributor

siddarthkay commented Nov 27, 2024

Thinking further on the expected report the following items should be covered:

  • runtime or duration of the entire benchmark
  • start block and end block
  • command used to run this benchmark
  • hardware information the benchmark was run on

The folder structure could be

short benchmark
│
└───2024-11-27
│   │   {HH:MM:SS-git-short-commit}-metrics-export.csv
│   │   {HH:MM:SS-git-short-commit}-build-environment.json
└───2024-11-26
│   │   {HH:MM:SS-git-short-commit}-metrics-export.csv
│   │   {HH:MM:SS-git-short-commit}-build-environment.json
└───2024-11-25
│   │   {HH:MM:SS-git-short-commit}-metrics-export.csv
│   │   {HH:MM:SS-git-short-commit}-build-environment.json

and similar for long benchmark

@jakubgs
Copy link
Member Author

jakubgs commented Nov 28, 2024

I wonder if the folder structure shouldn't be 2024/11/25/{HH:MM:SS-git-short-commit}-metrics-export.csv

Since if we go for each day then the root of the repo will be quite a big list very quickly.

@siddarthkay
Copy link
Contributor

Hmm indeed, also thinking from search standpoint, I believe nimbus team would be more interested in searching for the performance of the short benchmark for a particular commit, so a folder for commit would also not be a bad idea and then within that folder we could have various files which have the timestamp identifier attached.
Like this :

short benchmark
│
└─{git-commit-hash}
│   │           └──{ISO Timestamp}-metrics-export.csv
│   │           └──{ISO Timestamp}-build-environment.log
└─{git-commit-hash}
│   │           └──{ISO Timestamp}-metrics-export.csv
│   │           └──{ISO Timestamp}-build-environment.log
└─{git-commit-hash}
│   │           └──{ISO Timestamp}-metrics-export.csv
│   │           └──{ISO Timestamp}-build-environment.log


long benchmark
│
└─{git-commit-hash}
│   │           └──{ISO Timestamp}-metrics-export.csv
│   │           └──{ISO Timestamp}-build-environment.log
└─{git-commit-hash}
│   │           └──{ISO Timestamp}-metrics-export.csv
│   │           └──{ISO Timestamp}-build-environment.log
└─{git-commit-hash}
│   │           └──{ISO Timestamp}-metrics-export.csv
│   │           └──{ISO Timestamp}-build-environment.log

@jakubgs
Copy link
Member Author

jakubgs commented Dec 2, 2024

I think using dates in folder structure will make for a nicer format. Using commits for folders is not great because when you call find | sort in the repo you won't get a correctly ordered list based off of timestamps, since commits will break that ordering.

@arnetheduck
Copy link
Member

Additional feature: when a benchmark runs, it should be compared against the previous commit using https://github.com/status-im/nimbus-eth1/blob/master/scripts/block-import-stats.py - this script compares two CSV files and outputs a comparison table as can be seen in this comment: status-im/nimbus-eth1#2413 (comment)

See also: https://github.com/status-im/nimbus-eth1/tree/master/scripts#block-import-statspy

Its output could be saved to a text file together with the other outputs

@siddarthkay
Copy link
Contributor

Example of comparing current run with previous run for short benchmark :
https://github.com/status-im/nimbus-eth1-benchmarks/blob/master/short-benchmark/20241215T163401_650fec5a/build-environment.log

I will clean up all old / incomplete benchmarks from the repository now.
The repo does contain multiple reports of the same commit, they were run in order to test the setup of benchmarking automation.

@siddarthkay
Copy link
Contributor

siddarthkay commented Dec 27, 2024

short benchmarking reports have been stable for a while.
a recent long benchmarking report was pushed here status-im/nimbus-eth1-benchmarks@cd8d97c

I consider this task as done, unless there are any more changes or bugs in the reports.

@siddarthkay
Copy link
Contributor

siddarthkay commented Jan 2, 2025

another requirement is redirecting output of blocks-import python script to README.md so that it can be rendered on Github. This would be a benchmark level Readme.

A main repo README.md is also required and could be generated like this:

cat README.tmpl > README.md
grep Time $(find -name build-environment.log | sort -r) | sed -rn 's~./(.*)/build.*Time.*: (.*), (.*)~|[\1](\1/)|\2|\3|~p' > README.md

Generated README.md would look like this :

# Benchmarks

Benchmarks for nimbus-eth1 .. bla bla

## Results

| Benchmark | Time | Diff |
| --- | ---: | ---: |

@jakubgs
Copy link
Member Author

jakubgs commented Jan 2, 2025

grep Time $(find -name build-environment.log | sort -r) | sed -rn 's~./(.*)/build.*Time.*: (.*), (.*)~|[\1](\1/)|\2|\3|~p' > README.md

What kind of monstrocity is this? You know you can template files in bash using envsubst and just env variables?

@siddarthkay
Copy link
Contributor

yes, just copy pasting what Jacek had mentioned in chat to keep track.

@arnetheduck
Copy link
Member

What kind of monstrocity is this

"whatever" as long as it puts an overview table in the "top-level" readme

@siddarthkay
Copy link
Contributor

main readme is generated here : https://github.com/status-im/nimbus-eth1-benchmarks/blob/master/README.md

uses this template : https://github.com/status-im/nimbus-eth1-benchmarks/blob/master/README-TEMPLATE.md and generated on each benchmark with envsubst

@siddarthkay
Copy link
Contributor

I consider this as completed.
We could use open other issues to track enhancements OR bug fixes if they are required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants