Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Slurm Deployment For Xorbits #719

Merged
merged 75 commits into from
Oct 31, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
5e71c9f
support for slurm deployment
Sep 25, 2023
fa5a831
Merge branch 'main' into main
mergify[bot] Sep 25, 2023
019c949
support for slurm
Aprilerr Sep 25, 2023
a3b90b0
slurm for support
Aprilerr Sep 25, 2023
7130d9c
Merge branch 'main' of https://github.com/fengsxy/xorbits into main
Aprilerr Sep 25, 2023
527f34f
support for slurm
fengsxy Sep 25, 2023
e61b814
fix pre-commit
Aprilerr Sep 25, 2023
dc237e1
support for slurm
Aprilerr Sep 25, 2023
a8f0446
remove the cluster
Aprilerr Sep 25, 2023
b375fef
logging instead of print
Aprilerr Sep 25, 2023
6488731
change for default
fengsxy Sep 26, 2023
2536405
support for slurm pytest
fengsxy Sep 26, 2023
8c18a61
support for slurm pytestgi
fengsxy Sep 26, 2023
8961ad1
support for slurm
fengsxy Sep 26, 2023
c63232c
modified: slurm.sh
fengsxy Sep 26, 2023
391b6fa
modified: ../python/xorbits/deploy/cluster/Slurm.py
fengsxy Sep 26, 2023
10af0bf
Merge branch 'xorbitsai:main' into main
fengsxy Sep 28, 2023
10ccff2
support for slurm
fengsxy Sep 29, 2023
abb344f
change for workflow
fengsxy Sep 29, 2023
4470dcf
change for workflow
fengsxy Sep 29, 2023
d52c6cc
change for workflow
fengsxy Sep 29, 2023
a61defd
support for slurm
fengsxy Sep 29, 2023
9d24d65
pytest add for xorbits slurm deploy
fengsxy Sep 29, 2023
50e3d80
pytest add for xorbits slurm deploy
fengsxy Sep 29, 2023
4cddfbd
Merge branch 'main' into main
mergify[bot] Oct 8, 2023
43c0eb0
Merge branch 'main' into main
mergify[bot] Oct 8, 2023
af859ea
Merge branch 'main' into main
mergify[bot] Oct 9, 2023
49e58e9
Merge branch 'main' into main
mergify[bot] Oct 9, 2023
8ba17c9
Merge branch 'main' into main
mergify[bot] Oct 10, 2023
19f255d
t p
fengsxy Oct 15, 2023
e65645f
Merge branch 'main' of github.com:fengsxy/xorbits into main
fengsxy Oct 15, 2023
5900cad
Delete CI/slurm.sh
fengsxy Oct 15, 2023
86ed567
modified: .gitignore
fengsxy Oct 15, 2023
1c969c6
Merge branch 'main' of github.com:fengsxy/xorbits into main
fengsxy Oct 15, 2023
8a69102
new file: CI/slurm.sh
fengsxy Oct 15, 2023
8cabb9f
t
fengsxy Oct 15, 2023
fb6630b
deleted: .github/workflows/cluster.yaml
fengsxy Oct 15, 2023
f7539ee
t
fengsxy Oct 15, 2023
49541bb
modified: ../python/xorbits/deploy/slurm/tests/test_slurm.py
fengsxy Oct 15, 2023
ecc9dcf
modified: .github/workflows/python.yaml
fengsxy Oct 16, 2023
c24f086
modified: doc/source/user_guide/deployment_slurm.rst
fengsxy Oct 16, 2023
1084199
Merge branch 'main' into main
mergify[bot] Oct 16, 2023
64f5565
Merge branch 'main' into main
mergify[bot] Oct 16, 2023
1684b9d
Merge branch 'main' into main
mergify[bot] Oct 17, 2023
b4f6239
Merge branch 'main' into main
mergify[bot] Oct 19, 2023
c7c7fb2
Merge branch 'main' into main
mergify[bot] Oct 20, 2023
4797b4d
Update python.yaml
fengsxy Oct 25, 2023
da3a740
Merge branch 'main' into main
mergify[bot] Oct 26, 2023
1a66e1e
Update python.yaml
fengsxy Oct 30, 2023
6fdd0ac
Update .pre-commit-config.yaml
fengsxy Oct 30, 2023
d2f815d
Update .pre-commit-config.yaml
fengsxy Oct 30, 2023
ee3b1a5
Update deployment_slurm.rst
fengsxy Oct 30, 2023
a26ea08
Update .gitignore
fengsxy Oct 30, 2023
a12d136
Update .gitignore
fengsxy Oct 30, 2023
0951564
Update deployment_slurm.rst
fengsxy Oct 31, 2023
7729af6
Update deployment_slurm.rst
fengsxy Oct 31, 2023
df98447
Update deployment_slurm.rst
fengsxy Oct 31, 2023
8f80e71
Update deployment_slurm.rst
fengsxy Oct 31, 2023
678a317
Update deployment_slurm.rst
fengsxy Oct 31, 2023
5247784
Update deployment_slurm.rst
fengsxy Oct 31, 2023
57b51b1
Update deployment_slurm.rst
fengsxy Oct 31, 2023
91419c9
Update deployment_slurm.rst
fengsxy Oct 31, 2023
a0d0633
Update deployment_slurm.rst
fengsxy Oct 31, 2023
deabf52
Update deployment_slurm.rst
fengsxy Oct 31, 2023
e70d203
Update deployment_slurm.rst
fengsxy Oct 31, 2023
940f331
Update deployment_slurm.rst
fengsxy Oct 31, 2023
6a931e5
Update deployment_slurm.rst
fengsxy Oct 31, 2023
e89053d
Update deployment_slurm.rst
fengsxy Oct 31, 2023
fbd4e02
Update deployment_slurm.rst
fengsxy Oct 31, 2023
b89d7b2
Update deployment_slurm.rst
fengsxy Oct 31, 2023
2bb54b3
Update deployment_slurm.rst
fengsxy Oct 31, 2023
d0bc341
Update deployment_slurm.rst
fengsxy Oct 31, 2023
52dc124
Update deployment_slurm.rst
fengsxy Oct 31, 2023
e390033
Update deployment_slurm.rst
fengsxy Oct 31, 2023
a0da962
Update deployment_slurm.rst
fengsxy Oct 31, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/workflows/cluster.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
name: CI
fengsxy marked this conversation as resolved.
Show resolved Hide resolved

on: [push, pull_request]

jobs:
build:
runs-on: ubuntu-latest
timeout-minutes: 30
defaults:
run:
shell: bash -l {0}
strategy:
fail-fast: false
matrix:
jobqueue: ["slurm"]

steps:
- name: Cancel previous runs
uses: styfle/[email protected]
with:
access_token: ${{ github.token }}
- name: Checkout source
uses: actions/checkout@v2

- name: Setup Empty Conda Environment with Mamba
if: matrix.jobqueue == 'none'
uses: conda-incubator/setup-miniconda@v2
with:
channels: conda-forge
mamba-version: "*"
activate-environment: xorbits
auto-activate-base: false

- name: Setup xorbits conda environment
if: matrix.jobqueue == 'none'
run: |
mamba env update -f CI/requirements-wheel.txt
mamba list

- name: Setup Job queuing system
if: matrix.jobqueue != 'none'
run: |
source CI/${{ matrix.jobqueue }}.sh
jobqueue_before_install

- name: Install xorbits
run: |
source CI/${{ matrix.jobqueue }}.sh
jobqueue_install

- name: Test
run: |
source CI/${{ matrix.jobqueue }}.sh
jobqueue_script

- name: Cleanup
if: always()
run: |
source CI/${{ matrix.jobqueue }}.sh
jobqueue_after_script
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ __pycache__/

# C extensions
*.so

fengsxy marked this conversation as resolved.
Show resolved Hide resolved
.pre-commit-config.yaml
# Distribution / packaging
.Python
build/
Expand Down
5 changes: 0 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,6 @@ repos:
additional_dependencies: [tokenize-rt==3.2.0]
exclude: _mars
args: [--ignore-missing-imports, --follow-imports, skip]
- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.0.0 # Use the sha or tag you want to point at
hooks:
- id: prettier
types_or: [html, javascript]
fengsxy marked this conversation as resolved.
Show resolved Hide resolved
- repo: https://github.com/codespell-project/codespell
rev: v2.2.5
hooks:
Expand Down
45 changes: 45 additions & 0 deletions CI/slurm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#!/usr/bin/env bash
fengsxy marked this conversation as resolved.
Show resolved Hide resolved

function jobqueue_before_install {
docker version
docker-compose version

# start slurm cluster
cd ./CI/slurm
docker-compose pull
./start-slurm.sh
cd -

#Set shared space permissions
docker exec slurmctld /bin/bash -c "chmod -R 777 /shared_space"

docker ps -a
docker images
show_network_interfaces
}

function show_network_interfaces {
for c in slurmctld c1 c2; do
echo '------------------------------------------------------------'
echo docker container: $c
docker exec $c python -c 'import psutil; print(psutil.net_if_addrs().keys())'
echo '------------------------------------------------------------'
done
}

function jobqueue_install {
docker exec slurmctld /bin/bash -c "cd xorbits/python/; pip install -e ."
}

function jobqueue_script {
docker exec c1 /bin/bash -c "pip install xorbits"
docker exec c2 /bin/bash -c "pip install xorbits"
docker exec slurmctld /bin/bash -c "python /xorbits/python/xorbits/deploy/cluster/Slurm.py"
docker exec slurmctld /bin/bash -c "cat /shared_space/output.out"
}

function jobqueue_after_script {
docker exec slurmctld bash -c 'sinfo'
docker exec slurmctld bash -c 'squeue'
docker exec slurmctld bash -c 'sacct -l'
}
2 changes: 2 additions & 0 deletions CI/slurm/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
FROM daskdev/dask-jobqueue:slurm
RUN pip install xorbits
120 changes: 120 additions & 0 deletions CI/slurm/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
version: "2.2"

services:
mysql:
image: mysql:5.7.29
hostname: mysql
container_name: mysql
environment:
MYSQL_RANDOM_ROOT_PASSWORD: "yes"
MYSQL_DATABASE: slurm_acct_db
MYSQL_USER: slurm
MYSQL_PASSWORD: password
volumes:
- var_lib_mysql:/var/lib/mysql
networks:
common-network:

slurmdbd:
image: daskdev/dask-jobqueue:slurm
build: .
command: ["slurmdbd"]
container_name: slurmdbd
hostname: slurmdbd
volumes:
- etc_munge:/etc/munge
- etc_slurm:/etc/slurm
- var_log_slurm:/var/log/slurm
expose:
- "6819"
depends_on:
- mysql
networks:
common-network:

slurmctld:
image: daskdev/dask-jobqueue:slurm
build: .
command: ["slurmctld"]
container_name: slurmctld
hostname: slurmctld
environment:
- CI_SHARED_SPACE=/shared_space
volumes:
- etc_munge:/etc/munge
- etc_slurm:/etc/slurm
- slurm_jobdir:/data
- var_log_slurm:/var/log/slurm
- ../..:/xorbits
- shared_space:/shared_space
expose:
- "6817"
depends_on:
- "slurmdbd"
networks:
common-network:
ipv4_address: 10.1.1.10
cap_add:
- NET_ADMIN

c1:
image: daskdev/dask-jobqueue:slurm
build: .
command: ["slurmd"]
hostname: c1
container_name: c1
volumes:
- etc_munge:/etc/munge
- etc_slurm:/etc/slurm
- slurm_jobdir:/data
- var_log_slurm:/var/log/slurm
- ../..:/xorbits
- shared_space:/shared_space
expose:
- "6818"
depends_on:
- "slurmctld"
networks:
common-network:
ipv4_address: 10.1.1.11
cap_add:
- NET_ADMIN

c2:
image: daskdev/dask-jobqueue:slurm
build: .
command: ["slurmd"]
hostname: c2
container_name: c2
volumes:
- etc_munge:/etc/munge
- etc_slurm:/etc/slurm
- slurm_jobdir:/data
- var_log_slurm:/var/log/slurm
- ../..:/xorbits
- shared_space:/shared_space
expose:
- "6818"
depends_on:
- "slurmctld"
networks:
common-network:
ipv4_address: 10.1.1.12
cap_add:
- NET_ADMIN

volumes:
etc_munge:
etc_slurm:
slurm_jobdir:
var_lib_mysql:
var_log_slurm:
shared_space:

networks:
common-network:
driver: bridge
ipam:
driver: default
config:
- subnet: 10.1.1.0/24
5 changes: 5 additions & 0 deletions CI/slurm/register_cluster.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash
set -e

docker exec slurmctld bash -c "/usr/bin/sacctmgr --immediate add cluster name=linux" && \
docker-compose restart slurmdbd slurmctld
98 changes: 98 additions & 0 deletions CI/slurm/slurm.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# slurm.conf
#
# See the slurm.conf man page for more information.
#
ClusterName=linux
ControlMachine=slurmctld
ControlAddr=slurmctld
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/lib/slurmd
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
SlurmdPidFile=/var/run/slurmd/slurmd.pid
ProctrackType=proctrack/linuxproc
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=300
Waittime=30
#change this avoids low resource kill the process
#**log**
#srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
#slurmstepd: error: *** STEP 27.0 ON c1 CANCELLED AT 2023-09-25T06:30:54 ***

# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/jobcomp.log
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdbd
AccountingStoragePort=6819
AccountingStorageLoc=slurm_acct_db
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
NodeName=c[1-2] RealMemory=4096 CPUs=2 State=UNKNOWN
#
# PARTITIONS
PartitionName=normal Default=yes Nodes=c[1-2] Priority=50 DefMemPerCPU=2048 Shared=NO MaxNodes=2 MaxTime=5-00:00:00 DefaultTime=5-00:00:00 State=UP
17 changes: 17 additions & 0 deletions CI/slurm/start-slurm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash

docker-compose up -d --no-build

while [ `./register_cluster.sh 2>&1 | grep "sacctmgr: error" | wc -l` -ne 0 ]
do
echo "Waiting for SLURM cluster to become ready";
sleep 2
done
echo "SLURM properly configured"

# On some clusters the login node does not have the same interface as the
# compute nodes. The next three lines allow to test this edge case by adding
# separate interfaces on the worker and on the scheduler nodes.
docker exec slurmctld ip addr add 10.1.1.20/24 dev eth0 label eth0:scheduler
docker exec c1 ip addr add 10.1.1.21/24 dev eth0 label eth0:worker
docker exec c2 ip addr add 10.1.1.22/24 dev eth0 label eth0:worker
Loading