Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with accelerate and deepspeed training #331

Open
swang99 opened this issue Apr 6, 2023 · 4 comments
Open

Issues with accelerate and deepspeed training #331

swang99 opened this issue Apr 6, 2023 · 4 comments

Comments

@swang99
Copy link

swang99 commented Apr 6, 2023

I've been having issues trying to distribute training onto multiple GPUs. Even after following this pull request #316, I check the nvidia-smi log and it still shows all the load being on one GPU, whether using deepspeed or accelerate.

Here are the commands I tried to train the actor models:
deepspeed artifacts/main.py artifacts/config/config.yaml --type RL
accelerate launch artifacts/main.py artifacts/config/config.yaml --type RL
Everything else I simply kept as default configuration and did not touch anything else.
Any ideas? Thank you.

@EikeKohl
Copy link

EikeKohl commented Apr 6, 2023

Hi @swang99,
Did you also not set deepspeed_enable to True in the config.yaml? Because this has to be done.
If so, can you please share the config.yaml as well as the deepspeed config you used?

@swang99
Copy link
Author

swang99 commented Apr 8, 2023

Hi @EikeKohl. Yes, here is my config.yaml

---
trainer_config:
  # learning rates
  actor_lr: 0.000005
  critic_lr: 0.000009
  # PPO Hyperparameters
  actor_eps_clip: 0.2
  critic_eps_clip: 0.2
  beta_s: 0.02
  # coefficient for the discounted rewards
  gamma_discounted: 1 
  # path to examples to be sampled (training dataset) see rlhf_dataset.json
  examples_path: "./datasets/rlhf_training_data.json"
  # number of episodes and generation performed for each episode
  # in the train() method
  num_episodes: 100
  max_timesteps: 32
  # number of timesteps after which the learn() method is called 
  # (to update the weights)
  update_timesteps: 32
  # number of example sampled at each timestep
  num_examples: 1
  # batch and epochs for the training
  batch_size: 1
  epochs: 1
  # number of episodes after which update the checkpoints in RL training
  checkpoint_steps: 1000
  # here specify the name of the actor_rl checkpoint from which resume 
  # during actor RL training. If null load the last one.
  checkpoint_name: null

actor_config:
  model: "facebook/opt-1.3b"
  model_folder: "./models"
  tokenizer_path: "path-to-tokenizer"
  train_dataset_path: "./datasets/actor_training_data.json"
  validation_dataset_path: null
  # froze model embedding during training
  froze_embeddings: True
  # use fairscale layers to build the model instead of vanilla pytorch
  # only for llama
  use_fairscale: False
  # max sequence length for the actor (i.e. prompt + completion) it depends on
  # the model used.
  max_sequence_length: 2048
  # max tokens generated by the actor (completion only)
  max_tokens: 2048
  # minimum number of tokens generated by the actor
  min_tokens: 100
  # additional prompt tokens to be used for template or as safety
  additonal_prompt_tokens: 20
  # temperature for the actor
  temperature: 0.1
  batch_size: 2
  # number iteration after print
  iteration_per_print: 1
  lr: 0.000009
  epochs: 1
  # number of backpropagation after saving the checkpoints
  checkpoint_steps: 5000
  # number of checkpoints to keep while removing the older 
  # (keep memory consumption of checkpoints reasonable)
  n_checkpoints_to_keep: 5
  # here specify the name of the actor checkpoint from which resume 
  # during actor training. If null load the last one.
  checkpoint_name: null
  # deepspeed settings
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config.json"
  # accelerate settings
  accelerate_enable: False
  # use_peft - the parameters of PEFT can be modified in the peft_config.yaml
  peft_enable: False
  peft_config_path: "./artifacts/config/peft_config.yaml"

reward_config:
  # model to be chosen are gp2-large, bart-base, longformer-base-4096
  # more can be simply added in the reward.py __init__()
  model: "facebook/opt-125m"
  model_folder: "./models"
  # hidden size of the additional ffw head to produce the scores
  model_head_hidden_size: 2048
  max_sequence_length: 2048
  train_dataset_path: "./datasets/reward_training_data.json"
  validation_dataset_path: null
  batch_size: 8
  epochs: 1
  iteration_per_print: 1
  # steps after which the checkpoint are saved
  checkpoint_steps: 10000
  # here specify the name of the reward checkpoint from which resume 
  # during reward training. If null load the last one.
  checkpoint_name: null
  lr: 0.000009
  # deepspeed settings
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config.json"
  # accelerate settings
  accelerate_enable: False

critic_config:
  # model to be chosen are gp2-large, bart-base, longformer-base-4096
  # more can be simply added in the reward.py __init__()
  model: "facebook/opt-125m"
  # hidden size of the additional ffw head to produce the scores
  model_head_hidden_size: 2048
  max_sequence_length: 2048
  model_folder: "./models"
  # here specify the name of the critic checkpoint from which resume 
  # during critic training. If null load the last one.
  checkpoint_name: null

And the ds_config.json:

{
    "train_batch_size": 8,
    "gradient_accumulation_steps": 1,
    "optimizer": {
      "type": "Adam",
      "params": {
        "lr": 0.00015
      }
    },
    "fp16": {
      "enabled": false,
      "auto_cast": false,
      "loss_scale": 0,
      "initial_scale_power": 16,
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients" : true,
    "offload_param": {
      "device": "cpu",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5,
      "buffer_size": 1e8,
      "max_in_cpu": 1e9
    },
    "offload_optimizer": {
      "device": "cpu",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 4,
      "fast_init": false
    },
    "stage3_max_live_parameters" : 1e9,
    "stage3_max_reuse_distance" : 1e9,
    "stage3_prefetch_bucket_size" : 5e8,
    "stage3_param_persistence_threshold" : 1e6,
    "sub_group_size" : 1e12,
    "elastic_checkpoint" : true,
    "stage3_gather_16bit_weights_on_model_save": true,
    "ignore_unused_parameters": true,
    "round_robin_gradients": true
    }
}

@PierpaoloSorbellini
Copy link
Collaborator

Hi @swang99 thanks for opening the issue.
I think I got the problem,
If you look to the PR #306 there is an updated config.yaml (with added fields for distributed training with RL)
It seems that yours is not updated, could you check that?
Thanks again!

@EikeKohl
Copy link

@swang99 was your issue resolved with the latest repo state? If not, could you please share your nvidia-smi output? And you could also try zero optimization stage 3 (https://www.deepspeed.ai/tutorials/zero/#zero-overview)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants