Skip to content

Latest commit

 

History

History
211 lines (130 loc) · 4.64 KB

clusterRescomp.md

File metadata and controls

211 lines (130 loc) · 4.64 KB

ssh to clusterRescompX

In OSX you need to install and open XQuartz.

Then loging to the server enabling X11 forwarding.

Go ahead and open any GUI you need (e.g., $ gedit textfile).

Load Anaconda

Before executing of submiting a task you have to load the modules you will need:

module use -a /mgmt/modules/eb/modules/all #Add the path where the Anacoda module is located.
module load Anaconda3/5.1.0

Recommended: Put above two lines in your .bashrc (nano ~/.bashrc)

Create your virtual environment

Before installing any python package it is recommended to create an enviroment as:

$ conda create --name myenv 

You can create a new enviroment with all your required packages using a a .yml file like:

$ conda create -f my_packages.yml 

To activate the enviroment just type:

$ source activate myenv 

If you forgot any packages in your enviorment you can add new ones using:

$ conda install package_name 

More information about anaconda enviroments can be found here.

Example of a script file to send to the cluster queue:

#!/bin/bash

#$ -P rittscher.prjb -q gpu9.q
#$ -l gpu=1 

module use -a /mgmt/modules/eb/modules/all
module load Anaconda3/5.1.0
source activate pytorch90-env

python -c "import torch; print('N GPU: {}'.format(torch.cuda.device_count()))"

echo "Finished at :"`date`
exit 0

Mount the group data directory using ssh (on mac):

Here are instructions.

In brief install osxfuse and sshfs using homebrew. You can mount the files typing something like:

$ sudo sshfs -o allow_other,defer_permissions [email protected]:/well/rittscher/users/ /Volumes/rescomp1

note: for LINUX/Ubuntu defer_permissions has to be replaced with default_permissions

Copy data to group data directory using ssh directly (linux/mac):

$ scp -r fullPathofYourLocalDirectory [email protected]:/well/rittscher/users/yourAccountName

or

$ rsync -aP fullPathofYourLocalDirectory [email protected]:/well/rittscher/users/yourAccountName

Start an interative session on a node:

$ qlogin -P rittscher.prjb -q short.qb -pe shmem 1 (prj* -q short.q*, '*'--> a/b/c)

For GPU clusters you have to specify the number of gpu's or you will not be allowed to start the session.

$ qlogin -P rittscher.prjb -q gpu8.q -pe shmem 1 -l gpu=1

Start a jupyter notebook in an interactive server:

Start a remote session using tunneling:

$ ssh -L 8080:localhost:8080 [email protected]

$ jupyter notebook --no-browser --port=8080

Ipython should print a link with an access token. You can then we can just copy and paste the link in your local browser and execute jupyter scripts.

Sometimes the port might be in use. Then change the port and start again.

You can also use jupyter on the GPU nodes by doing another tunneling between rescomp and your node. For example, to run a jupyter notebook on port 8888 of compG008 and accessible on port 8080 of rescomp, you would use the following tunneling:

[LOCAL_COMPUTER]$  ssh -L 8080:localhost:8080 [email protected]
[RESCOMP_LOGIN_NODE]$  ssh -L 8080:localhost:8888 compG008
[compG008]$ jupyter notebook --no-browser --port=8888

Run your script in cluster:

Here are instructions

Few most commonly used:

$ qsub myScript.sh 

$ qsub -l h_vmem=1G,h_rt=01:50:20 -N testCluster myScript.sh

To see the jobs in queue or all:

$ qstat -u $USER

$ qstat -s p/r

$ qsum -u $USER (compact summary, use -h for help)

See jobs not in queue:

$ qstat -f -ne

Kill your jobs:

$ qdel $jobID

$ qselect -u <username> | xargs qdel

Check/monitor state of execution hosts:

$ qhost -q

$ qconf -sql

$ qload -u $USER

$ qload -nh -v

Assigning your job to a specific node (TODO: need to be checked!!):

$ qsub -q gpu.q@compG002  -N testGPU myScript.sh (use qhost -q to check possibilities)

$ qsub -q himem.qh@compH000 -N testHighMem myScript.sh

Check softwares on the server before you install one:

$ /apps/well/

$ /mgmt/modules/eb/modules 

$ modules avail

Check GPU type and capacities:

$ nvidia-smi
$ print(torch.version.cuda)