Skip to content

manueldiaz96/usingGrid5000

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Guide to use Grid5000 (Chroma Team @ Inria)

Grid'5000 is a large-scale and versatile testbed for experiment-driven research in all areas of computer science, with a focus on parallel and distributed computing including Cloud, HPC and Big Data. Grid5000's website

This guide is mainly focused on how to use Grid5000 as an alternative to processing when hardware like GPUs are not available for local use.

First step: Get an account.

There are two main types of accounts:

  • Academics from France: Those currently working on any research project in France or Academics abroad working on a collaboration with academics in France (The former need to ask to their french collaborators for details).
  • Open Access Program: People who are not on a collaboration can request a lower priority account. Private companies interested need to contact Gird5000's executive committee members.

For this part you will need to give your ssh public key. If you have not generated one, follow this tutorial to generate one

Second step: Choose a cluster to work in

This is a list of all the hardware available on Grid5000. Check the list to know which cluster better suits your needs. At the moment of creation of this guide, these were the clusters with CUDA capable GPUs:

Site Cluster Available GPUs Queue
Lille chifflet Nvidia GTX 1080Ti x 2 default
Lille chifflot Nvidia Tesla P100 x 2 and Nvidia Tesla V100 x 2 default
Lyon orion Nvidia Tesla M2075 default
Nancy graphique Nvidia Titan Black x 2 and Nvidia GTX 980 x 2 production
Nancy grele Nvidia GTX 1080Ti x 2 production
Nancy grimani Nvidia Tesla K40M default

Once you have chosen a cluster, you can log in into your account via: ssh [email protected], to then ssh to the site that has the cluster you want to work in, e.g. ssh nancy / ssh lille / ssh lyon. Now you should be able to access your home directory on any of the Grid5000's clusters.

Third step: Set up your work environment

Note about storage: Default storage per user on Grid5000 is 25GB, but if you need more storage, you can request a bigger quota on the Grid5000 api user storage tab (login needed).

Install Miniconda

The steps listed here are based on this tutorial (login needed) by the user Ibada. This guide only covers the setting up process with Anaconda, Miniconda specifically since is lighter. All commands are executed from the user's home directory.

First, download Miniconda depending on the version of Python you will be working with. If you are working with Python 2.7, change the version of Miniconda to 2 instead of 3.

For Python 3.7:

user@site:~$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

Then, run it:

user@site:~$ bash Miniconda3-latest-Linux-x86_64.sh

Here, the instalation guide will prompt you to choose the path where Miniconda will be installed. Also, it will prompt you to choose if the conda environment starts with bash (default is no).

Finally, just copy the .bashrc script available on this repository and change user for your username on lines 119, 123, 124, 127. It contains many useful features to personalize your bash experience, but more importantly if you chose the default option for conda to be disabled, sourcing this script allows you to activate the conda environment.

user@site:~$ source .bashrc
(base) user@site:~$ #Environment now 

Create the virtual environment and Install the needed libraries

With Miniconda set up, you can now create environments for your projects via:

conda create --name env

And install conda supported libraries and packages via Anaconda Cloud, Conda Forge or any other channel you want.

Fourth step: Ask for compute time on the GPU clusters

You can check here (login needed) the status of each of the clusters' availability, to see if your desired hardware is busy or not.

These bash scripts facilitate the process of asking for jobs. Both are mainly using on the oarsub commands and using the default queue, check the node's hardware table to see which queue the GPUs you want to use are in:

  • ask_for_job_fixed_time.sh has a fixed job time, can be used to quickly test if the environment recognizes the cluster's GPUs
  • ask_for_job_input_time.sh lets you input time as an argument in the format hh:mm:ss. Can be used e.g. when you have an estimated train time for a network.

For example, using the ask_for_job_input_time.sh:

user@flille:~$ bash ask_for_job_scripts/ask_for_job_input_time.sh 00:05:00
 Remember to source bashrc!
 Remember to activate the conda env!
[ADMISSION RULE] Modify resource description with type constraints
[ADMISSION_RULE] Resources properties : \{'property' => 'type = \'default\'','resources' => [{'resource' => 'host','value' => '1}]}
[ADMISSION RULE] Job properties : (GPU <> 'NO') AND maintenance = 'NO'
Generate a job key...
OAR_JOB_ID=1681786
Interactive mode: waiting...
Starting...

Connect to OAR job 1681786 via the node chifflet-6.lille.grid5000.fr
user@chifflet-6:~$ source .bashrc 
(base) user@chifflet-6:~$ conda activate pytorch_env
(pytorch_env) user@chifflet-6:~$ python pytorch_probe_gpus.py
GeForce GTX 1080 Ti detected on device 0
GeForce GTX 1080 Ti detected on device 1
(pytorch_env) user@chifflet-6:~$ #GPUs detected!

Once you are in a job, you can use the available hardware on that specific cluster for your computations.

Ask for an specific GPU

Since some clusters have more than one type of GPU, using the ask_for_job_input_time_and_gpu.sh you can ask for a specific GPU on a cluster. This script takes as first argument the time in hh:mm:ss format, and as second argument the wanted GPU name in 'quotes'. The names can be consulted in the OAR Properties on the Monika page of each site at the status page of G5000 (login needed).

user@fnancy:~/usingGrid5000$ bash ask_for_job_scripts/ask_for_job_input_time_and_gpu.sh 00:05:00 'GTX 980'
 Remember to source bashrc!
 Remember to activate the conda env!
 Asking for job with GTX 980 
[ADMISSION RULE] Modify resource description with type constraints
[ADMISSION RULE] Assign max_walltime property for production resources selection
[ADMISSION_RULE] Resources properties : \{'resources' => [{'value' => '1','resource' => 'host'}],'property' => '((type = \'default\') AND production = \'YES\') AND (max_walltime >= 300 OR max_walltime <= 0)'}
[ADMISSION RULE] Job properties : (GPU = 'GTX 980') AND maintenance = 'NO
Generate a job key...
OAR_JOB_ID=1930362
Interactive mode: waiting...
Starting...

Connect to OAR job 1930362 via the node graphique-5.nancy.grid5000.fr
user@graphique-5:~/usingGrid5000$ source .bashrc 
(base) user@graphique-5:~/usingGrid5000$ conda activate pytorch-env
(pytorch-env) user@graphique-5:~/usingGrid5000$ python pytorch_probe_gpus.py 
GeForce GTX 980 detected on device 0
GeForce GTX 980 detected on device 1
(pytorch-env) user@graphique-5:~/usingGrid5000$ #Got wanted GPUs!

Other useful commands

To transfer a file from the machines to your local PC via secure copy:

user@localPC:~$ scp [email protected]:site/path_from_home/file.py /home/user/directory/file.py #for single files
user@localPC:~$ scp -r [email protected]:site/path_from_home/directory /home/user/directory/ #for directories

To transfer a file from your PC to a cluster via secure copy:

user@localPC:~$ scp /home/user/directory/file.py [email protected]:site/path_from_home/file.py #for single files
user@localPC:~$ scp -r /home/user/directory/ [email protected]:site/path_from_home/directory #for directories

Commands to check or delete jobs:

user@site:~$ oarstat -u #check if you have any jobs running on this site and the state of them
user@site:~$ oardel JOB_ID #delete any job you no longer need by giving the JOB_ID number

Check your storage:

user@site:~$ du -h --max-depth=1 | sort -hr

For more in-depth usage of Grid5000 for Deep Learning, Check Ibada's tutorial

About

Some guides to use the Grid5000 services

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published