ssh [email protected]
ssh -X [email protected] # for gui
ssh -p 4422 [email protected] # connecting from outside iitkgp
- 40GB storage and 50GB hard limit
- has backup
- use it to store important files, outputs, logs, etc.
- don't store datasets here. don't submit jobs from here.
- 2TB storage
- no backup
- use it to store datasets, code, etc.
- submit jobs from here
- export job outputs to
if needed
Easy just follow the instructions here
mkdir -p ~/miniconda3
wget -O ~/miniconda3/
bash ~/miniconda3/ -b -u -p ~/miniconda3
rm -rf ~/miniconda3/
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
Since we require sudo to use the default package manager, yum, we will install packages to our home directory and add the binaries to our path.
- Create a directory to store the packages and downloaded .rpm files
mkdir -p ~/centos # for installed packages
mkdir -p ~/rpm # for downloading .rpm files
- Add the following to your .bashrc or .zshrc
export PATH="$HOME/centos/usr/sbin:$HOME/centos/usr/bin:$HOME/centos/bin:$PATH"
export MANPATH="$HOME/centos/usr/share/man:$MANPATH"
export LD_LIBRARY_PATH="$L:$HOME/centos/usr/lib:$HOME/centos/usr/lib64"
- now download .rpm using
yumdownloader --destdir ~/rpm --resolve <package_name>
and install usingrpm2cpio <package_name>.rpm | cpio -D ~/centos -idmv
- you can use the script to install all the packages in the rpm directory
- or you can run
python3 <package_name>
to install a single package. find the python script here
- Use
module avail
to see all available modules - Use
module load <module_name>
to load a module - Latest cuda version installed is 11.7 so don't just
pip install torch
. You'll have to compile torch with correct cuda version. Usemodule load compiler/cuda/11.7
in your job script before submitting the job on gpu nodes. see:
eg for cuda 11.7 :
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
# or use pip
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
- Find the available modules here (as of Jan 2024)
You can use tmux with sessions or
srun --overlap --pty --jobid <jobid> /bin/bash
sattach <jobnum>.<num> # change the num to 0, 1, ...
# example: sattach 1167246.0
- First download and extract the requried package
wget --no-check-certificate
zcat < install-tl-unx.tar.gz | tar xf -
cd install-tl-*
- Create appropriate directories for the installation
mkdir -p ~/centos/usr/texlive/2024
- Run and installer but change the directories
perl ./install-tl
# follow the instructions on the terminal to first change the installation directory from /usr/.. to ~/centos/usr...
# Then return to main menu and continue installation
- Add the installed binary location (
) to your PATH
pip3 install -U 'mujoco-py<2.2,>=2.1' numpy scipy quaternion numpy-quaternion mujoco
mkdir ~/.mujoco && cd ~/.mujoco
tar -xf mujoco210-linux-x86_64.tar.gz
rm mujoco210-linux-x86_64.tar.gz
- add the following to your .bashrc or .zshrc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco210/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia
- Install any missing dependencies using the script
sbatch <job_script>
to submit a jobsqueue
to see jobsscancel <job_id>
to cancel a jobsinfo
to see nodessinfo -s
to see nodes in a table
- make sure your environment has
- submit an interactive bash job by running
srun -p gpu --time=<H>:<MM>:<SS> --gres=gpu:<num_gpus> --pty bash
- activate your environment
- (optional) use
to (detachably) multiplex the shell - run
hostname -i
and note down your gpu node's IP, say asip
(if you don't know it already) - run
jupyter notebook --port XXXX --no-browser
- copy one of the full links (after Jupyter Server
is running at:), e.g.http://localhost:<PORT>/tree?token=<TOKEN>
- many ports are blocked so note down which port
(<PORT> above)
the jupyter kernel is actually listening on - on your local machine, in a new shell make a tunnel by running
ssh -t -t <USER> -L localhost:<PORT>:localhost:<PORT> ssh <USER>@<ip> -L localhost:<PORT>:localhost:<PORT>
- open the link you copied in step 7 in a browser on your local machine
- GPU nodes do not have access to the internet
- Set wandb to offline mode using
export WANDB_MODE=offline # on shell
os.environ["WANDB_MODE"] = "offline" # in jupyter or inside a script
wandb.init( ..., mode="offline")
kinda incomplete i'll update it as i learn more :p
- Reference: