Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Hyperparameter Search #84

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

lowlypalace
Copy link
Contributor

@lowlypalace lowlypalace commented Dec 16, 2023

PR Description

This PR makes hyperparameter search more scalable by running the training loop for each agent in parallel. This is especially useful when running the search on a cluster (e.g. Slurm).

  • Assign workers to devices such as GPU / CPU. The list of devices can now be provided as an argument.
  • Fix an issue when we were not setting number of threads for each process. This basically made the CPU parallelization even slower when a sequential run. Now this is fixed. Link to the issue.
  • Add num_workers parameter to define the size of the pool of processes. Still not sure about this one though as we still need to wait for all of the workers from the same iteration so that we can get the hypervolume. So maybe we can just spawn the same number of processes as num_seeds.
  • There was a bug in one of the initializers of the Envelope and PCN algorithms, where we were passing device as an id making it essentialy always default to the same device.

TODO:

  • True GPU parallelization of policy evaluation / model update step.
  • Add examples of runs with different configurations

Example Configs on a Slurm Cluster

Using 4 GPUs + 4 workers

#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes

#SBATCH --cpus-per-task 4 # number of processes
#SBATCH -G 4

python experiments/hyperparameter_search/launch_sweep.py \
--algo envelope \
--env-id minecart-v0 \
--sweep-count 100 \
--seed 10 \
--num-seeds 4 \
--num-workers 4 \
--devices cuda:0 cuda:1 cuda:2 cuda:3 

Using 4 CPUs + 4 workers

#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes

#SBATCH --cpus-per-task 4 # number of processes

python experiments/hyperparameter_search/launch_sweep.py \
--algo envelope \
--env-id minecart-v0 \
--sweep-count 100 \
--seed 10 \
--num-seeds 4 \
--num-workers 4 

Each worker will use auto and then each algo instance will default to cpu as CUDA is not available.

Example Runs on a Slurm Cluster

Example Runs:

  Workers     CPUs    GPUs CPU Usage GPU Usage Sweeps
4 4 0 94.88% N/A 18
4 1 0 95.55% N/A 15
1 1 0 25.03% N/A 15
4 4 1 18.78% 99% 5
4 4 4 31.07% 9%
11%
12%
10%
5
4 1 4 98.85% 4%
5%
5%
5%
13
  • Workers corresponds to num_workers, CPUs corresponds to --cpus-per-task, GPUs corresponds to -G
  • Number of seeds set to 4 (i.e. training 4 agents).
  • GPU Usage measured through srun -s --jobid <job-id> --pty nvidia-smi command while running the job.
  • CPU Usage measured via seffcommand after finishing the job.
  • CPU Config: Intel Broadwell or Skylake processors.
  • GPU Config: Tesla V100.
  • Each run lasted 10 hours.

@lowlypalace
Copy link
Contributor Author

lowlypalace commented Dec 16, 2023

Any idea why black is complaining as I've run the linter on my end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant