Skip to content

Commit

Permalink
ZKWAS-296 (#7)
Browse files Browse the repository at this point in the history
* init param ftp test

* add ftp client

* try lftp dl in command

* use wget

* try with depends on

* fix indent

* basic healthcheck ftp

* update check

* dev

* rm depends upon for prover

* separate ftp and main services

* try staggered depends upon

* no verbose wget

* update readme

* fix log time

* huge page setting in multi node

* update readme

* readme update

* use our explorer server params files and change name

* ZKWAS-303 (#8)

* start prover-node service script

* fix syntax

* Update README.md

* add dockerignore

* update release hash

* update hash

---------

Co-authored-by: Yimin Yu <[email protected]>
  • Loading branch information
rhaoio and yymone authored Jun 29, 2024
1 parent 0730fef commit 3d8cf05
Show file tree
Hide file tree
Showing 6 changed files with 145 additions and 31 deletions.
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
./mongo
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
# Install required packages and setup ssh access
RUN apt-get update && apt-get install -y --no-install-recommends openssh-server sudo cmake curl build-essential git && rm -rf /var/lib/apt/lists/* \
RUN apt-get update && apt-get install -y --no-install-recommends openssh-server sudo cmake curl build-essential git wget && rm -rf /var/lib/apt/lists/* \
&& sudo apt update -y && sudo apt install -y apache2-utils \
&& mkdir /var/run/sshd \
&& /etc/init.d/ssh start \
Expand All @@ -19,7 +19,7 @@ RUN git config --global url.https://github.com/.insteadOf [email protected]:

RUN git clone https://github.com/DelphinusLab/prover-node-release && \
cd prover-node-release && \
git checkout be216b3fdb562a7e7d5982c6262768e6c977015c
git checkout a298d2feffd8296cc98b97caf314424942ea1a43

WORKDIR /home/zkwasm/prover-node-release

Expand Down
96 changes: 77 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This is the docker container for the prover node. This container is responsible
- [HugePages Configuration](#hugepages-configuration)
- [GPU Configuration](#gpu-configuration)
- [Multiple Nodes on the same machine](#multiple-nodes-on-the-same-machine)
- [Upgrading Prover Node](#upgrading-prover-node)

## Environment

Expand Down Expand Up @@ -82,7 +83,7 @@ The image is currently built with

- Ubuntu 22.04
- CUDA 12.2
- prover-node-release #be216b3fdb562a7e7d5982c6262768e6c977015c
- prover-node-release #a298d2feffd8296cc98b97caf314424942ea1a43

**Important!**
The versions should not be changed unless the prover node is updated. The compiled prover node binary is sensitive to the CUDA version and the Ubuntu version.
Expand Down Expand Up @@ -197,6 +198,16 @@ services:
If using host network mode, the port mapping will be ignored, and the port will be the default `27017`.
Specify the port by adding `--port <PORT>` to the `command` field in the `docker-compose.yml` file for the mongodb service.

**Important** If you change the DB Port under network_mode: host, you must also update the healthcheck to use the correct port.

```yaml
services:
mongodb:
command: --config /data/configdb/mongod.conf --port 8099
healthcheck:
test: echo 'db.runCommand("ping").ok' | mongosh localhost:8099/test --quiet
```

##### Logging and log rotation

`mongo`'s logging feature is very basic and doesn't have the ability to clean up old logs, so instead we use dockers logging feature.
Expand All @@ -215,32 +226,27 @@ Finally, we use `host` `network_mode`, this is because our server code refers to

</details>

## Start

Make sure you had built the image via `bash build_image.sh`
Start all services at once with the following command, however it may clog up the terminal window as they all run in the same terminal so you may run some services in detached mode.

`docker compose up`
## Quick Start

To start multiple containers on a machine, use the following command
We require our Params FTP Server to be running before starting the prover node. The prover node must copy the parameters from the FTP server to it's own volume to operate correctly.

`docker compose -p <node> up` where `node` is the unique name of the container/project you would like to start.
### Params FTP Server

Ensure the docker compose file has GPU's specified for each container.
Start the FTP server with `docker compose -f ftp-docker-compose.yml up params-ftpup`.

### Starting individual services
The default port is `21` and the default user is `ftpuser` with password `ftppassword`. The ports used for file transfer are `30000-30009`.

It may be cleaner to start services individually. You can start each in a new terminal window, or in the background.
### Prover Node

To start each service in the background, use the following command
Make sure you had built the image via `bash build_image.sh`

`docker compose start <service>`
Once the Params FTP server is running, you can start the prover node.

To start an attached service, use the following command:
Start all services at once with the following command, however it may clog up the terminal window as they all run in the same terminal so you may run some services in detached mode.

`docker compose up <service>`
`docker compose up` This will run the base services in order of mongodb, dry-run-service, prover-node

It is required to start `mongodb` service first and then `prover-node` + `prover-dry-run-service` services.
## Multiple Prover Nodes

### Multiple Nodes on the same machine

Expand Down Expand Up @@ -272,6 +278,12 @@ Ensure the MongoDB instance is unique for each node. This is done by modifying t
- Modify the `mongodb`services - `container_name` field to a unique value such as `zkwasm-mongodb-2` etc.
- Set the correct port to bind to the host machine. Please refer to the MongoDB configuration section for more information.
- If using host network mode, the port is not required to be specified under services, but may be specified as part of the command field e.g `--port 8099`.
- If supplying a custom port with `network_mode: host`, ensure the port is unique for each node. Ensure the healthcheck is updated to use the correct port.
```yaml
command: --config /data/configdb/mongod.conf --port XXXX
healthcheck:
test: echo 'db.runCommand("ping").ok' | mongosh localhost:XXXX/test --quiet
```

Ensure the `dry_run_config.json` file is updated with the correct MongoDB URI for each node.

Expand All @@ -283,14 +295,60 @@ Private key should be UNIQUE for each node.

Ensure the `dry_run_config.json` file is updated with the correct server URL and MongoDB URI for each node.

#### HugePages Configuration

Running multiple nodes requires HugePages to be expanded to accommodate the memory requirements of each node.

Each prover-node requires roughly 15000 hugepages, so ensure the `vm.nr_hugepages` is set to the correct value on the **HOST MACHINE**.

`sudo sysctl -w vm.nr_hugepages=30000` for two nodes, `45000` for three nodes, etc.

#### Docker volume and container names

Ensure the docker volumes are unique for each node. This is done by modifying the `docker-compose.yml` file for each node.

The simplest method is to start the containers with a different project name from other directories/containers.

`docker compose -p <node> up -d`
`docker compose -p <node_name> up`, This should start the services in order of mongodb, dry-run-service, prover-node

Where `node` is the custom name of the services you would like to start i.e `node-2`. This is important to separate the containers and volumes from each other.

Follow the output of the container with `docker logs -f <node>_<service>` (Full name of container, which can be found with `docker ps` or `docker container ls`)
### Logs

If you need to follow the logs/output of a specific container,

First navigate to the corresponding directory with the `docker-compose.yml` file.

Then run `docker logs -f <service-name>`

Where `service-name` is the name of the SERVICE named in t he docker compose file (mongodb, prover-node etc.)

## Upgrading Prover Node

Upgrading the prover node requires rebuilding the docker image with the new prover node binary, and clearing previously stored data.

Stop all containers with `docker compose down`.

Manually stop the containers with `docker container ls` and then `docker stop <container-name-or-id>`.

Prune the containers with `docker container prune`.

### Pull Latest Changes

Pull the latest changes from the repository with `git pull`.

You many need to stash changes if you have modified the `docker-compose.yml` file and apply them again.

Similarly, if `prover_config.json` or `dry_run_config.json` have been modified, ensure the changes are applied again.

### Delete Volume

Find the correct volume you would like to delete with `docker volume ls`.

Delete the prover-node workspace volume with `docker volume rm <volume_name>`. By default volume_name is "prover-node-docker_workspace-volume"

### Rebuild Docker Image

Remove the old docker image with `docker image ls` to check the image name and then `docker image rm zkwasm:latest`

Rebuild the docker image with `bash build_image.sh`.
20 changes: 20 additions & 0 deletions _start_prover-node-service.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
nvidia-smi && \
# Huge Pages
echo 'Checking Huge Pages configuration:'
cat /proc/meminfo | grep Huge
ls -lh /dev/hugepages

# Check HugePages_Free
hugepages_free=$(cat /proc/meminfo | grep -i hugepages_free | awk '{print $2}')
echo "HugePages_Free: $hugepages_free"

if [ $hugepages_free -lt 15000 ]; then
echo "Error: HugePages_Free ($hugepages_free) is less than 15000. Please make sure HugePages is configured correctly on the host machine. Requires 15000 HugePages configured per node."
exit 1
fi

# Download param files from local FTP server
wget -r -nH -nv --cut-dirs=1 --no-parent --user=ftpuser --password=ftppassword ftp://localhost/params/ -P /home/zkwasm/prover-node-release/workspace/static/ && \
time=$(date +%Y-%m-%d-%H-%M-%S) && \
CUDA_VISIBLE_DEVICES=0 RUST_LOG=info RUST_BACKTRACE=1 ./target/release/zkwasm-playground --config prover_config.json -w workspace --dryrunconfig dry_run_config.json -p \
2>&1 | rotatelogs -e -n 10 logs/prover/prover_${time}.log 100M
36 changes: 26 additions & 10 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
version: "3.8"

services:
mongodb:
image: mongo:latest
attach: false
network_mode: "host"
# ports:
# - "27017:27017"
Expand All @@ -15,6 +14,12 @@ services:
max-size: "10m"
max-file: "5"
command: --config /data/configdb/mongod.conf
healthcheck:
test: echo 'db.runCommand("ping").ok' | mongosh localhost:27017/test --quiet
start_period: 5s
interval: 30s
timeout: 10s
retries: 3
container_name: zkwasm-mongodb
restart: always
prover-dry-run-service:
Expand All @@ -25,6 +30,15 @@ services:
build:
context: .
dockerfile: Dockerfile
depends_on:
mongodb:
condition: service_healthy
healthcheck:
test: pgrep -f zkwasm-playground
start_period: 15s
interval: 15s
timeout: 5s
retries: 3
volumes:
- ./prover_config.json:/home/zkwasm/prover-node-release/prover_config.json
- ./dry_run_config.json:/home/zkwasm/prover-node-release/dry_run_config.json
Expand Down Expand Up @@ -52,6 +66,9 @@ services:
image: zkwasm:latest
runtime: nvidia
network_mode: "host"
depends_on:
prover-dry-run-service:
condition: service_healthy
deploy:
resources:
reservations:
Expand Down Expand Up @@ -79,21 +96,20 @@ services:
- prover-logs-volume:/home/zkwasm/prover-node-release/logs/prover
# configure huge pages for the prover
- /dev/hugepages:/dev/hugepages
# Starting script for the prover
- ./_start_prover-node-service.sh:/home/zkwasm/prover-node-release/_start_prover-node-service.sh
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "5"
environment:
- TZ=Etc/UTC
- NVM_DIR=/home/zkwasm/.nvm
- NODE_VERSION=16.19.1
- PATH=$NVM_DIR/versions/node/v$NODE_VERSION/bin:$PATH
command: bash -c "
nvidia-smi && \
time=$$(date +%Y-%m-%d-%H-%M-%S)
CUDA_VISIBLE_DEVICES=0 RUST_LOG=info RUST_BACKTRACE=1 ./target/release/zkwasm-playground --config prover_config.json -w workspace --dryrunconfig dry_run_config.json -p \
2>&1 | rotatelogs -e -n 10 logs/prover/prover_$${time}.log 100M"
command:
[
"/bin/bash",
"/home/zkwasm/prover-node-release/_start_prover-node-service.sh",
]
volumes:
workspace-volume:
prover-logs-volume:
Expand Down
19 changes: 19 additions & 0 deletions ftp-docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
services:
params-ftp:
image: zkwasm/params
network_mode: "host"
# ports:
# - "21:21"
# - "30000-30009:30000-30009"
environment:
PUBLICHOST: "localhost"
FTP_USER_NAME: ftpuser
FTP_USER_PASS: ftppassword
FTP_USER_HOME: /home/ftpuser
# ADDED_FLAGS: "-p 2121:2121 -p 30000-31000:30000-31000"
healthcheck:
# Basic health check to ensure the FTP server is running
test: "ls -l /var/run/pure-ftpd.pid"
interval: 30s
timeout: 10s
retries: 3

0 comments on commit 3d8cf05

Please sign in to comment.