Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Distribute the Server and Agent on two PCs #260

Closed
ll7 opened this issue Apr 3, 2024 · 25 comments · Fixed by #266
Closed

[Feature]: Distribute the Server and Agent on two PCs #260

ll7 opened this issue Apr 3, 2024 · 25 comments · Fixed by #266
Assignees

Comments

@ll7
Copy link
Member

ll7 commented Apr 3, 2024

Description

  • Automate the distribution of the Server and the Agent over two PCs.

Definition of Done

No response

@ll7
Copy link
Member Author

ll7 commented May 13, 2024

We tried to use docker swarm. https://docs.docker.com/engine/swarm/swarm-tutorial/#open-protocols-and-ports-between-the-hosts

However, we were unable to open our machien ports. We tried it wit ufw and iptables. Both did not work.

@ll7 ll7 closed this as completed May 13, 2024
@ll7
Copy link
Member Author

ll7 commented May 13, 2024

Port status can be checked with sudo ufw status verbose and sud nmap -p- localhost

@ll7 ll7 reopened this May 13, 2024
@ll7
Copy link
Member Author

ll7 commented Jun 11, 2024

If you want to start a docker swarm, docker can not run in a rootless mode. Instead you have to use:
sudo docker swarm init --advertise-addr <your-ip-address>

So far, our concept expects the agent to run on the swarm manager pc, because we include local files with volume mounts.

As soon as your docker swarm init you will be presented with a command that has to be executed on another node pc, to add this pc to your docker swarm.

To deploy a docker-compose.yml with specified services, use sudo docker deploy -c docker-compose.yml <your-stack-name>.

@ll7
Copy link
Member Author

ll7 commented Jun 11, 2024

If you want to deploy a service on a specific node, the easiest way is to label that node and constrain the deployment in the docker-compose.yml.

docker node update --label-add manager=true <your-node-id-or-name>

services:
  agent:
    build:
      context: ../
      dockerfile: build/docker/agent/Dockerfile
      args:
        - USER_UID=${DOCKER_HOST_UNIX_UID:-1000}
        - USER_GID=${DOCKER_HOST_UNIX_GID:-1000}
    image: my_custom_agent_image:latest
    deploy:
      placement:
        constraints:
          - node.labels.manager == true
    init: true
    tty: true

everything after deploy is new.

@ll7
Copy link
Member Author

ll7 commented Jun 11, 2024

  • How do we mount local files if we deploy on any node?
  • Our "self-build" images throw the error that we do not provide an image name.
    • No such image: <our-image-name-from-docker-compose.yml>
  • So far, we haven't seen any gui output. We currently guess that this is because we now deploy with the root user and not with the rootless docker user. Therefore, we might have to adapt the xhost +local:docker command that enables the x11 access and therefore window management.

@ll7
Copy link
Member Author

ll7 commented Jun 17, 2024

server rendering offscreen:

So far, the first attempt with SDL did not work.

@ll7
Copy link
Member Author

ll7 commented Jun 17, 2024

We tried to create a shared volume for the paf23 folder and the x11 folder.

sudo docker volume create --driver local --opt type=none --opt device=/home/Documents/paf23 --opt o=bind --sharing all paf23

sudo docker volume create --driver local --opt type=none --opt device=/tmp/.x11-unix --opt o=bind --sharing all x11

the docker-compose.yml is modified by using the volume name and adding the volume sperately at the end of the docker-compose file.

services:
  carla-simulator:
    command: /bin/bash CarlaUE4.sh -quality-level=High -world-port=2000 -resx=800 -resy=600 -nosound -carla-settings="/home/carla/CarlaUE4/Config/CustomCarlaSettings.ini"
    image: ghcr.io/una-auxme/paf23:leaderboard-2.0
    init: true
    deploy:
      resources:
        limits:
          memory: 16G
    expose:
      - 2000
      - 2001
      - 2002
    environment:
      - XDG_RUNTIME_DIR=/tmp/runtime-carla
    networks:
      - carla
    volumes:
      - x11:/tmp/.X11-unix

volumes:
  x11:
  paf23:

@ll7
Copy link
Member Author

ll7 commented Jun 17, 2024

A docker node stays in a docker swarm after a PC restart.

@ll7
Copy link
Member Author

ll7 commented Jun 17, 2024

We think about using the manual solution using ssh to start the carla server on a remote pc.

@ll7
Copy link
Member Author

ll7 commented Jun 17, 2024

Our latest error: sh: 1: xdg-user-dir: not found

Either there is a package missing in the carla server, or we have an issue with the xhost configuration.

sudo apt install xdg-user-dirs

xhost +192.168.1.100

@ll7
Copy link
Member Author

ll7 commented Jun 17, 2024

We currently use https://www.portainer.io/ to get an overview of our docker images in a local web browser.

@ll7 ll7 linked a pull request Jun 18, 2024 that will close this issue
7 tasks
@ll7
Copy link
Member Author

ll7 commented Jun 18, 2024

Docker PAF23 Swarm 2024-06-17

#260

  • docker swarm needs root priviliges, because it modifiys the network. It is not possible to run docker swarm with [[docker rootless]].

  • docker swarm init creates a local docker swarm

  • docker swarm leave --force leaves the docker swarm as the docker swarm manager and shuts down the docker swarm.

  • docker swarm join-token <worker/manager> creates a join command to join the exisitng swarm as a manger or as a worker.

  • I don't know if each manager needs to have a worker to execute any commands.

    • Error response from daemon: This node is already part of a swarm. suggests that each init creates a manager, but the manager is also a worker.
  • If you want to enable file sharing, you need to run a docker container

  • created a new docker-compose file with only the simulator.

  • Run new docker-compose file as normal user.

luttkule@imech156-u:~/git/paf23$ docker compose -f build/docker-compose.swarm.yml up
WARN[0000] /home/luttkule/git/paf23/build/docker-compose.swarm.yml: `version` is obsolete 
[+] Running 1/0
 ✔ Container build-carla-simulator-1  Created                                                                             0.0s 
Attaching to carla-simulator-1
carla-simulator-1  | sh: 1: xdg-user-dir: not found
carla-simulator-1 exited with code 1

The following suggests, that the server always outputs the Error is always there carla-simulator-1 | sh: 1: xdg-user-dir: not found. However, the service starts and keeps running.

luttkule@imech156-u:~/git/paf23$ b5 run
b5 1.4.1
Found project path (/home/luttkule/git/paf23)
Found Taskfile (Taskfile)
Found config (config.yml, config.local.yml)
Config files ending in ".yml" are deprecated, please use ".yaml" instead
Executing task run

non-network local connections being added to access control list
WARN[0000] /home/luttkule/git/paf23/build/docker-compose.yml: `version` is obsolete 
WARN[0000] /home/luttkule/git/paf23/build/docker-compose.nvidia.yml: `version` is obsolete 
[+] Running 6/6
 ✔ Container paf23-flake8-1           Recreated                                                                           0.1s 
 ✔ Container paf23-mdlint-1           Created                                                                             0.0s 
 ✔ Container paf23-roscore-1          Recreated                                                                           0.2s 
 ✔ Container paf23-carla-simulator-1  Recreated                                                                           0.2s 
 ✔ Container paf23-agent-1            Recreated                                                                           0.2s 
 ✔ Container paf23-comlipy-1          Recreated                                                                           0.1s 
Attaching to agent-1, carla-simulator-1, comlipy-1, flake8-1, mdlint-1, roscore-1
comlipy-1          | ⧗    input: .
comlipy-1          | ⚠    scope may not be empty [scope-empty]
comlipy-1          | ✖    subject may not be empty [subject-empty]
comlipy-1          | ✖    type may not be empty [type-empty]
comlipy-1          | 
comlipy-1          | ✖    found 2 problems, 1 warnings
comlipy-1          | ⓘ    Help: https://gitlab.com/slashplus-build/comlipy
comlipy-1 exited with code 1
carla-simulator-1  | sh: 1: xdg-user-dir: not found
flake8-1 exited with code 0
carla-simulator-1  | 4.26.2-0+++UE4+Release-4.26 522 0
carla-simulator-1  | Disabling core dumps.
mdlint-1 exited with code 0
agent-1            | 
agent-1            | ========= Preparing RouteScenario_0 (repetition 0) =========
agent-1            | > Loading the world

However, if I run the server isolated, then the simulator exits.

luttkule@imech156-u:~/git/paf23$ docker compose -f build/docker-compose.swarm.yml up
[+] Running 1/0
 ✔ Container build-carla-simulator-1  Created                                                                             0.0s 
Attaching to carla-simulator-1
carla-simulator-1  | sh: 1: xdg-user-dir: not found
carla-simulator-1 exited with code 1

Using https://carla.readthedocs.io/en/latest/build_docker/ to run the simplest docker image pssible

https://carla.readthedocs.io/en/latest/adv_rendering_options/#off-screen-mode

docker run --privileged --gpus all --net=host -e DISPLAY=$DISPLAY carlasim/carla:latest /bin/bash ./CarlaUE4.sh does successfully run the simulator.

The following file is successfully executed with docker compose -f build/docker-compose.carla-server.yaml up

services:
  carla-simulator:
    image: ghcr.io/una-auxme/paf23:leaderboard-2.0
    command: /bin/bash ./CarlaUE4.sh quality-level=Low -resx=800 -resy=600 -nosound -world-port=2000
    environment:
      - DISPLAY=${DISPLAY}
    privileged: true
    network_mode: host
    deploy:
      resources:
        reservations:
          devices:
          - capabilities: ["gpu"]

Now I will create a swarm and try to launch the script there.

docker stack deploy -c build/docker-compose.carla-server.yaml carla_stack
docker stack deploy -c build/docker-compose.carla-server.yaml carla_stack services.carla-simulator.deploy.resources.reservations Additional property devices is not allowed

This https://stackoverflow.com/questions/72029582/docker-compose-returns-error-about-property-devices-when-trying-to-enable-gpu/72691651#72691651 brought me here:
https://gist.github.com/RafaelWO/290b764e88933b0c0769b6d2394fcad2

luttkule@imech156-u:~/git/paf23$ docker info
Client: Docker Engine - Community
 Version:    26.1.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 19
  Running: 0
  Paused: 0
  Stopped: 19
 Images: 34
 Server Version: 26.1.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: ot72phbz2fb5kfah74wks6ddn
  Is Manager: true
  ClusterID: v4ik2hj8hmnvh9eh9aw4qiifc
  Managers: 1
  Nodes: 1
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 137.250.121.22
  Manager Addresses:
   137.250.121.22:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d2d58213f83a351ca8f528a95fbd145f5654e957
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-35-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 62.69GiB
 Name: imech156-u
 ID: 9607a169-1a3f-41ad-907c-cfb96adc63b2
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Solutions to Enable Swarm GPU Support

Both solutions need to follow these steps first:

  1. Install nvidia-container-runtime. Follow the steps here. Takes <5 minutes.
  2. Update /etc/docker/daemon.json to use nvidia as the default runtime.
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
  1. Restart the docker daemon on each node sudo service docker restart. Confirm the default runtime is nvidia with docker info.

Successful update of the runtime

luttkule@imech156-u:~/git/paf23$ docker info
Client: Docker Engine - Community
 Version:    26.1.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 19
  Running: 1
  Paused: 0
  Stopped: 18
 Images: 34
 Server Version: 26.1.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: ot72phbz2fb5kfah74wks6ddn
  Is Manager: true
  ClusterID: v4ik2hj8hmnvh9eh9aw4qiifc
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 137.250.121.22
  Manager Addresses:
   137.250.121.22:2377
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: d2d58213f83a351ca8f528a95fbd145f5654e957
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-35-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 62.69GiB
 Name: imech156-u
 ID: 9607a169-1a3f-41ad-907c-cfb96adc63b2
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

@ll7
Copy link
Member Author

ll7 commented Jun 18, 2024

Update your /etc/docker/daemon.json

On each node!

Get your GPU-UUID with nvidia-smi -a.

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia",
  "node-generic-resources": [
    "NVIDIA-GPU=<GPU-UUID>"
    ]
}

Start Service

sudo docker service create --replicas 1 \
  --cap-add all \
  --network host \
  --name carla-server \
  --generic-resource "NVIDIA-GPU=0" \
  -e DISPLAY=$DISPLAY \
  -e XDG_RUNTIME_DIR=/tmp/runtime-carla \
  carlasim/carla:latest \
  /bin/bash ./CarlaUE4.sh

@ll7
Copy link
Member Author

ll7 commented Jun 18, 2024

Cannot find a compatible vulkan device or driver. Try updating your video driver to a more recent version and make sure your video card supports Vulkan.

@ll7
Copy link
Member Author

ll7 commented Jun 18, 2024

We were not able to start a swarm service with the docker image directly. However, we were able to launch a serive that started a docker run command to launch the carla server.

sudo docker service create \
  --name carla-server \
  --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock,ro \
  docker \
  docker run --privileged --gpus all --net=host -e DISPLAY=$DISPLAY ghcr.io/una-auxme/paf23:leaderboard-2.0 /bin/bash ./CarlaUE4.sh

Based on the following search result: https://serverfault.com/a/1089792

@ll7
Copy link
Member Author

ll7 commented Jun 18, 2024

To test the local connection, we used a pip install of the carla python api, upgraded pip, install pygame and numpy and then we were able to connect to the carla client/world and we loaded a new world.

(.venv) luttkule@auxme-imech039:~/git/carla-conncection-test$ python
Python 3.8.10 (default, Nov 22 2023, 10:22:35) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import carla
>>> client = carla.Client('localhost', 2000)
>>> world = client.get_world()
WARNING: Version mismatch detected: You are trying to connect to a simulator that might be incompatible with this API 
WARNING: Client API version     = 0.9.14 
WARNING: Simulator API version  = abd96a94 
>>> client.load_world('Town05')
<carla.libcarla.World object at 0x7f1b37f728c0>
>>> 

@ll7
Copy link
Member Author

ll7 commented Jun 18, 2024

  • run server on a node that is not your own pc and connect with local pythonapi
  • connect from an outside container
  • does it make a difference if the container is launched as a docker service or docker run
  • rewrite the cli service in a docker compose file
  • add portainer documentation
  • Why does it work with a second docker process and not with a docker-swarm command?

@ll7
Copy link
Member Author

ll7 commented Jun 18, 2024

sudo docker run -d -p 8000:8000 -p 9443:9443 --name portainer \
>   --restart=always \
>   -v /var/run/docker.sock:/var/run/docker.sock \
>   -v portainer_data:/data \
>   portainer/portainer-ce:latest

@ll7
Copy link
Member Author

ll7 commented Jun 20, 2024

When you get the following error in your docker service:

sh: 1: xdg-user-dir: not found
No protocol specified
error: XDG_RUNTIME_DIR not set in the environment.
No protocol specified
error: XDG_RUNTIME_DIR not set in the environment.
No protocol specified
error: XDG_RUNTIME_DIR not set in the environment.

Remember to use:

sudo xhost +local:

@ll7
Copy link
Member Author

ll7 commented Jun 20, 2024

Using the carla python api from a second pc does work without major and changing the client argument from 'localhost' to the server host ip-address is sufficient. Only a slight python-api version mismatch is reported.

@ll7
Copy link
Member Author

ll7 commented Jun 20, 2024

We used a new docker image based on 'ubuntu:focal', upgraded pip and installed carla as a pip package. We were able to replicate a connection.

@ll7
Copy link
Member Author

ll7 commented Jun 20, 2024

Next: try to use ssh to launch CarlaUE4.sh natively and rewirte compose file to connect to ssh pc.

@ll7
Copy link
Member Author

ll7 commented Jun 27, 2024

./CarlaUE4.sh -RenderOffScreen

@ll7
Copy link
Member Author

ll7 commented Jun 28, 2024

@ll7
Copy link
Member Author

ll7 commented Jun 28, 2024

b5 uses config files:

  • build/config.yml
  • build/config.local.yml

This helps to add gpu support defined in build/docker-compose.nvidia.yml whenever a gpu was available during the install process.

This is likely defined in the task file of b5 for the install task build/Taskfile.

paf23/build/Taskfile

Lines 38 to 59 in a890f26

task:shell() {
container="$1"
command="$2"
additionalArguments="${@:3}"
docker:container_run "${container:-agent}" "${command:-/bin/bash}" ${additionalArguments:-}
}
##########################################
# Project setup / maintenance
##########################################
task:install() {
task:install:git_hooks
#task:gitconfig:copy
install:gpu-support
docker:install
}
install:gpu-support() {
# check if docker-nvidia is installed, to make the project also executable on
# systems without nvidia GPU.
if [ -z "$(command -v docker-nvidia)" ]

install:gpu-support()' check for docker-nvidiaand runs the next tasknvidia:enable`

paf23/build/Taskfile

Lines 75 to 79 in a890f26

task:nvidia:enable() {
# Writes the content of templates/config.nvidia.yml.jinja2 to config.local.yml
# This file tells b5 to read docker-compose.nvidia.yml in addition to docker-compose.yml
template:render --overwrite ask-if-older templates/config.nvidia.yml.jinja2 config.local.yml
}

This enables build/templates/config.nvidia.yml.jinja2 which is a b5 specific config file to override compose files with nvidia and extends exisiting docker-compose.yml with a docker-compose.nvidia.ymlfile to add gpu support with
build/docker-compose.nvidia.yml

version: "3"
# This file should contain all nvidia GPU specific configuration.
# Only loaded of specified in config.local.yml
services:
carla-simulator:
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
environment:
- DISPLAY
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
agent:
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants