Original docs here.
SSH into one of the head nodes, i.e. mlp
, mlp1
or mlp2
.
Use sinfo
to check available nodes.
To change into a worker node, use srun
like so:
srun --nodelist=landonia04 --pty bash
Only head nodes have internet access, worker nodes do not.
AFS home directories can be accessed from head nodes but not from worker nodes.
Copy things from AFS homedir to cluster homedir first (whilst in a head node), then to worker nodes (scratch disks).
This is not shared across nodes and can be accessed as /disk/scratch
on every machine.
- Run
install_conda.sh
to ensure that Conda is installed and that thelid
environment exists - Run
source ~/.bashrc; conda activate lid
- Install Kaldi using
install_kaldi.sh
Use sbatch
to submit a job to slurm from a head node. See here for complete docs.
Simple example:
sbatch \
--nodelist=landonia[04-08] \
--gres=gpu:2 \
--job-name=LID \
--mail-type=END \
[email protected] \
--open-mode=append \
--output=cluster-exploration.out \
explore.sh
To check the job queue use squeue
.
- Ensure environment is activated and everything is installed
- Move data to head node
- Run script from head node
(echo 'YourDicePassword' | nohup longjob -28day -c './run.sh --exp-config=conf/exp_default.conf --stage=1' &> nohup-baseline.out ) &