@@ -5,7 +5,7 @@ Welcome to the comprehensive guide for using the Euler HPC cluster at ETH Zurich
55## 🚀 Quick Navigation
66
77!!! tip "Getting Started"
8- New to Euler? Start with our [ Complete Guide] ( complete-guide/ ) for detailed instructions on accessing and using the cluster.
8+ New to Euler? Start with our [ Complete Guide] ( complete-guide/ ) for detailed step-by-step instructions on accessing and using the cluster.
99
1010!!! example "Container Workflows"
1111 Learn how to build, deploy, and run [ containerized applications] ( container-workflow/ ) using Docker and Singularity on Euler.
@@ -18,280 +18,156 @@ Welcome to the comprehensive guide for using the Euler HPC cluster at ETH Zurich
1818
1919---
2020
21- ## 📋 Table of Contents
21+ ## 🎯 Quick Start
2222
23- 1 . [ Access Requirements] ( #access-requirements )
24- 2 . [ Quick Start SSH Setup] ( #quick-start-ssh-setup )
25- 3 . [ Storage Overview] ( #storage-overview )
26- 4 . [ Basic SLURM Commands] ( #basic-slurm-commands )
27- 5 . [ Container Workflow Summary] ( #container-workflow-summary )
28- 6 . [ Interactive Sessions] ( #interactive-sessions )
29- 7 . [ Support & Resources] ( #support-resources )
30-
31- ---
32-
33- ## ✅ Access Requirements
34-
35- To get access to the Euler cluster:
36-
37- 1 . ** Fill out the access form** : [ RSL Cluster Access Form] ( https://forms.gle/UsiGkXUmo9YyNHsH8 )
38- 2 . ** RSL members** : Directly message Manthan Patel for faster processing
39- 3 . ** Access approval** : Twice weekly (Tuesdays and Fridays)
40-
41- ** Prerequisites:**
42- - Valid nethz username and password (ETH Zurich credentials)
43- - Terminal access (Linux/macOS or Git Bash on Windows)
44- - Membership in RSL group (es_hutter)
45-
46- ---
47-
48- ## 🔐 Quick Start SSH Setup
49-
50- ### Basic Connection
23+ ### First Time Setup
5124``` bash
25+ # 1. SSH into Euler
5226ssh < your_nethz_username> @euler.ethz.ch
53- ```
54-
55- ### SSH Key Setup (Recommended)
56- ``` bash
57- # Generate SSH key
58- ssh-keygen -t ed25519 -C
" [email protected] " 59-
60- # Copy to Euler
61- ssh-copy-id < your_nethz_username> @euler.ethz.ch
62-
63- # Create SSH config (~/.ssh/config)
64- cat >> ~ /.ssh/config << EOF
65- Host euler
66- HostName euler.ethz.ch
67- User <your_nethz_username>
68- Compression yes
69- ForwardX11 yes
70- EOF
71-
72- # Now connect simply with:
73- ssh euler
74- ```
7527
76- ### Verify Your Access
77- ``` bash
78- # Check group membership
28+ # 2. Verify RSL group membership
7929my_share_info
8030# Should show: "You are a member of the es_hutter shareholder group"
8131
82- # Create your directories
32+ # 3. Create your directories
8333mkdir -p /cluster/project/rsl/$USER
8434mkdir -p /cluster/work/rsl/$USER
8535```
8636
87- ---
88-
89- ## 💾 Storage Overview
90-
91- | Location | Quota | Files | Purpose | Persistence |
92- | ----------| -------| -------| ---------| -------------|
93- | ** Home** ` /cluster/home/$USER ` | 45 GB | 450K | Code, configs | Permanent |
94- | ** Scratch** ` /cluster/scratch/$USER ` | 2.5 TB | 1M | Datasets, temp files | Auto-deleted after 15 days |
95- | ** Project** ` /cluster/project/rsl/$USER ` | 75 GB | 300K | Conda envs, software | Permanent |
96- | ** Work** ` /cluster/work/rsl/$USER ` | 150 GB | 30K | Results, containers | Permanent |
97- | ** Local** ` $TMPDIR ` | 800 GB | High | Job runtime data | Deleted after job |
98-
99- ### Check Your Usage
37+ ### Submit Your First Job
10038``` bash
101- # Home and Scratch
102- lquota
39+ # Create a simple test script
40+ cat > test_job.sh << 'EOF '
41+ #!/bin/bash
42+ #SBATCH --job-name=test
43+ #SBATCH --time=00:10:00
44+ #SBATCH --mem=4G
45+
46+ echo "Hello from $(hostname)"
47+ echo "Job ID: $SLURM_JOB_ID"
48+ EOF
10349
104- # Project and Work
105- (head -n 5 && grep -w $USER ) < /cluster/work/rsl/.rsl_user_data_usage.txt
106- (head -n 5 && grep -w $USER ) < /cluster/project/rsl/.rsl_user_data_usage.txt
50+ # Submit it
51+ sbatch test_job.sh
10752```
10853
10954---
11055
111- ## 🖥️ Basic SLURM Commands
56+ ## 📊 Quick Reference
11257
113- ### Submit a Job
58+ ### Storage Locations
59+ | Location | Quota | Purpose |
60+ | ----------| -------| ---------|
61+ | ` /cluster/home/$USER ` | 45 GB | Code, configs |
62+ | ` /cluster/scratch/$USER ` | 2.5 TB | Datasets (auto-deleted after 15 days) |
63+ | ` /cluster/project/rsl/$USER ` | 75 GB | Conda environments |
64+ | ` /cluster/work/rsl/$USER ` | 150 GB | Results, containers |
65+ | ` $TMPDIR ` | 800 GB | Fast local scratch (per job) |
66+
67+ ### Essential Commands
11468``` bash
115- # Basic job submission
116- sbatch my_job.sh
69+ # Job Management
70+ sbatch script.sh # Submit job
71+ squeue -u $USER # Check your jobs
72+ scancel < job_id> # Cancel job
11773
118- # Interactive session (2 hours, 8 CPUs, 32GB RAM)
119- srun --time=2:00:00 --cpus-per-task=8 --mem=32G --pty bash
74+ # Interactive Sessions
75+ srun --pty bash # Basic session
76+ srun --gpus=1 --pty bash # GPU session
12077
121- # GPU interactive session (4 hours, 1 GPU)
122- srun --time=4:00:00 --gpus=1 --mem=32G --pty bash
78+ # Storage Check
79+ lquota # Check home/scratch usage
12380```
12481
125- ### Monitor Jobs
82+ ### GPU Resources
12683``` bash
127- # Check your jobs
128- squeue -u $USER
129-
130- # Job details
131- scontrol show job < job_id >
84+ # Request specific GPU types
85+ # SBATCH --gpus=1 # Any available GPU
86+ # SBATCH --gpus=nvidia_geforce_rtx_4090:1 # RTX 4090 (24GB)
87+ # SBATCH --gpus=nvidia_a100_80gb_pcie:1 # A100 (80GB)
88+ ```
13289
133- # Cancel a job
134- scancel < job_id>
90+ ---
13591
136- # Job efficiency (after completion)
137- seff < job_id>
138- ```
92+ ## 🚀 Common Workflows
13993
140- ### Sample GPU Job Script
94+ ### GPU Training Job
14195``` bash
14296#! /bin/bash
143- # SBATCH --job-name=gpu-test
144- # SBATCH --output=logs/%j.out
145- # SBATCH --error=logs/%j.err
146- # SBATCH --time=04:00:00
97+ # SBATCH --job-name=training
14798# SBATCH --gpus=1
14899# SBATCH --cpus-per-task=8
149100# SBATCH --mem=32G
150- # SBATCH --tmp=50G
101+ # SBATCH --time=24:00:00
102+ # SBATCH --tmp=100G
151103
152104module load eth_proxy
153-
154- # Your GPU code here
155105python train.py
156106```
157107
158- ---
159-
160- ## 📦 Container Workflow Summary
161-
162- ### 1. Build Docker Image
163- ``` bash
164- # Create Dockerfile
165- cat > Dockerfile << EOF
166- FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
167- RUN apt-get update && apt-get install -y python3-pip
168- RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
169- COPY . /app
170- WORKDIR /app
171- CMD ["python3", "train.py"]
172- EOF
173-
174- # Build image
175- docker build -t my-ml-app:latest .
176- ```
177-
178- ### 2. Convert to Singularity
179- ``` bash
180- # Convert Docker to Singularity
181- apptainer build --sandbox my-ml-app.sif docker-daemon://my-ml-app:latest
182-
183- # Create tar for transfer
184- tar -cf my-ml-app.tar my-ml-app.sif
185- ```
186-
187- ### 3. Transfer to Euler
108+ ### Container Workflow
188109``` bash
189- scp my-ml-app.tar euler:/cluster/work/rsl/$USER /containers/
110+ # 1. Build locally
111+ docker build -t myapp:latest .
112+
113+ # 2. Convert to Singularity
114+ apptainer build --sandbox myapp.sif docker-daemon://myapp:latest
115+ tar -cf myapp.tar myapp.sif
116+
117+ # 3. Transfer & run on Euler
118+ scp myapp.tar euler:/cluster/work/rsl/$USER /
119+ # Then use in job script:
120+ tar -xf /cluster/work/rsl/$USER /myapp.tar -C $TMPDIR
121+ singularity exec --nv $TMPDIR /myapp.sif python app.py
190122```
191123
192- ### 4. Run on Euler
124+ ### Interactive Development
193125``` bash
194- #! /bin/bash
195- # SBATCH --job-name=container-job
196- # SBATCH --gpus=1
197- # SBATCH --tmp=100G
198-
199- # Extract to local scratch (fast!)
200- tar -xf /cluster/work/rsl/$USER /containers/my-ml-app.tar -C $TMPDIR
201-
202- # Run with GPU support
203- singularity exec --nv $TMPDIR /my-ml-app.sif python3 /app/train.py
126+ # JupyterHub: https://jupyter.euler.hpc.ethz.ch
127+ # Or command line:
128+ srun --gpus=1 --mem=32G --time=2:00:00 --pty bash
204129```
205130
206- [ → Full Container Workflow Guide] ( container-workflow/ )
207-
208131---
209132
210- ## 🔧 Interactive Sessions
211-
212- ### JupyterHub Access
213- - ** URL** : [ https://jupyter.euler.hpc.ethz.ch ] ( https://jupyter.euler.hpc.ethz.ch )
214- - ** Login** : Use your nethz credentials
215- - ** Features** : GPU support, VSCode option, pre-installed libraries
216-
217- ### Quick Interactive Commands
218- ``` bash
219- # Basic interactive session
220- srun --pty bash
221-
222- # Development session with GPU
223- srun --gpus=1 --mem=32G --time=2:00:00 --pty bash
224-
225- # High memory session
226- srun --mem=128G --time=1:00:00 --pty bash
133+ ## 📚 Documentation Structure
227134
228- # With local scratch
229- srun --tmp=100G --mem=32G --pty bash
230- ```
135+ - ** [ Complete Guide] ( complete-guide/ ) ** - Comprehensive setup and detailed instructions
136+ - ** [ Container Workflow] ( container-workflow/ ) ** - Full Docker/Singularity workflow with examples
137+ - ** [ Scripts Library] ( scripts/ ) ** - Ready-to-use job scripts and templates
138+ - ** [ Troubleshooting] ( troubleshooting/ ) ** - Solutions to common problems
231139
232140---
233141
234- ## 🐍 Python Environment Setup
142+ ## 🆘 Getting Help
235143
236- ### Miniconda Installation
237- ``` bash
238- # Install in project directory (more space)
239- mkdir -p /cluster/project/rsl/$USER /miniconda3
240- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
241- bash Miniconda3-latest-Linux-x86_64.sh -b -p /cluster/project/rsl/$USER /miniconda3
242- rm Miniconda3-latest-Linux-x86_64.sh
243-
244- # Initialize
245- /cluster/project/rsl/$USER /miniconda3/bin/conda init bash
246- conda config --set auto_activate_base false
247- ```
144+ ### Quick Links
145+ - ** Access Form** : [ RSL Cluster Access] ( https://forms.gle/UsiGkXUmo9YyNHsH8 )
146+ - ** RSL Contact
** : Manthan Patel (
[email protected] )
147+ - ** ETH IT Support** : [ ServiceDesk] ( https://ethz.ch/services/en/it-services/help.html )
148+ - ** Official Docs** : [ Euler Wiki] ( https://scicomp.ethz.ch/wiki/Euler )
248149
249- ### Create Environment
250- ``` bash
251- conda create -n ml_env python=3.10
252- conda activate ml_env
253- conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
254- ```
150+ ### Prerequisites
151+ ✅ Valid nethz account
152+ ✅ RSL group membership (es_hutter)
153+ ✅ Terminal access
154+ ✅ Basic Linux/SLURM knowledge
255155
256156---
257157
258- ## 🛠️ Quick Tips
158+ ## 🎓 Tips for Success
259159
260- !!! success "Best Practices"
261- - ** Use local scratch** (` $TMPDIR ` ) for I/O intensive operations
262- - ** Request only needed resources** to reduce queue time
263- - ** Save work frequently** - interactive sessions can timeout
264- - ** Use job arrays** for parameter sweeps
265- - ** Load ` eth_proxy ` module** for internet access
160+ !!! success "Do's"
161+ - Use ` $TMPDIR ` for I/O intensive operations
162+ - Request only the resources you need
163+ - Use containers for reproducible environments
164+ - Save important results to ` /cluster/work/rsl/$USER `
266165
267- !!! warning "Common Pitfalls"
268- - Don't install conda in home directory (limited inodes)
269- - Don't run jobs on login nodes
166+ !!! warning "Don'ts"
167+ - Don't run computations on login nodes
270168 - Don't exceed storage quotas
271- - Remember scratch data is auto-deleted after 15 days
272-
273- ---
274-
275- ## 📞 Support & Resources
276-
277- ### Getting Help
278- - ** Cluster Issues** : ETH IT ServiceDesk
279- - ** RSL Access
** : Contact Manthan Patel (
[email protected] )
280- - ** Guide Issues** : [ GitHub Issues] ( https://github.com/leggedrobotics/euler-cluster-guide/issues )
281-
282- ### Useful Links
283- - [ Official Euler Documentation] ( https://scicomp.ethz.ch/wiki/Euler )
284- - [ Getting Started with GPUs] ( https://scicomp.ethz.ch/wiki/Getting_started_with_GPUs )
285- - [ JupyterHub Access] ( https://jupyter.euler.hpc.ethz.ch )
286- - [ RSL Lab Homepage] ( https://rsl.ethz.ch )
287-
288- ### Tested Configuration
289- | Component | Version |
290- | -----------| ---------|
291- | ** Docker** | 24.0.7 |
292- | ** Apptainer** | 1.2.5 |
293- | ** Cluster** | Euler (ETH Zurich) |
294- | ** Group** | es_hutter (RSL) |
169+ - Don't leave interactive sessions idle
170+ - Don't store data only in scratch (auto-deleted!)
295171
296172---
297173
0 commit comments