Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loading checkpoint doesn't work for resuming training and test #247

Open
qiuyuchen14 opened this issue Jul 20, 2023 · 10 comments
Open

loading checkpoint doesn't work for resuming training and test #247

qiuyuchen14 opened this issue Jul 20, 2023 · 10 comments

Comments

@qiuyuchen14
Copy link

qiuyuchen14 commented Jul 20, 2023

Hi! I'm trying the humanoid SAC example but it doesn't seem to work with loading checkpoint for test and resuming training. Here are what I did:

  1. training: python train.py task=HumanoidSAC train=HumanoidSAC
  2. Test: python train.py task=HumanoidSAC train=HumanoidSAC test=True checkpoint=runs/HumanoidSAC_19-21-24-22/nn/HumanoidSAC.pth
  3. resume training: python train.py task=HumanoidSAC train=HumanoidSAC checkpoint=runs/HumanoidSAC_19-21-24-22/nn/HumanoidSAC.pth

The training itself works fine and the reward went up to >5000, but if I test or resume from the saved checkpoint, it doesn't seem to initialize with the weights properly and reward is around 40. I did a quick check and it went through restore and set full state weights though, so I'm not so sure where might be the problem. One thing I did change was here: weights['step']->weights['steps'] due to the KeyError.

I'm using rl-games 1.6.0 and IsaacGym 1.0rc4.

Thank you!!

@ViktorM
Copy link
Collaborator

ViktorM commented Jul 21, 2023

@qiuyuchen14 thank you for the reporting an issue. I'll take a look.

@qiuyuchen14
Copy link
Author

Thanks, @ViktorM ! Any updates?

@ViktorM
Copy link
Collaborator

ViktorM commented Jul 28, 2023

I reproduced the issue, also found another one and will push the fix tomorrow.

@sashwat-mahalingam
Copy link

Hi:

I've designed an environment in IsaacGym and am currently training an environment with the A2C Continuous PPO implementation. I am running into a similar error when trying to resume training from the same checkpoint or use a checkpoint for evaluation -- my training rewards are already converged around ~2000, while my evaluation rewards are ~200 for . As a more informative metric, the task terminates if the agent performs any unrecoverable behaviors such as falling down, or if episode length hits 700. My average episode length before termination in training is ~690-700 upon convergence, while my testing average episode length before termination is ~50-100. I do not have anything in my environment code that applies any change in environment in training or testing, and my training reward has consistently around 2000 (meaning I'm confident the issue is not an overfit) using 3e7 environment steps. I train with the following configuration:

params:
  seed: ${...seed}

  algo:
    name: a2c_continuous

  model:
    name: continuous_a2c_logstd

  network:
    name: actor_critic
    separate: False
    space:
      continuous:
        mu_activation: None
        sigma_activation: None

        mu_init:
          name: default
        sigma_init:
          name: const_initializer
          val: 0
        fixed_sigma: True
    mlp:
      units: [64, 32, 32]
      activation: elu
      
      initializer:
        name: default
      regularizer:
        name: None

  load_checkpoint: ${if:${...checkpoint},True,False} # flag which sets whether to load the checkpoint
  load_path: ${...checkpoint} # path to the checkpoint to load

  config:
    name: ${resolve_default:H1Reach,${....experiment}}
    full_experiment_name: ${.name}
    env_name: rlgpu
    multi_gpu: ${....multi_gpu}
    ppo: True
    mixed_precision: False
    normalize_input: True
    normalize_value: True
    reward_shaper:
      scale_value: 0.1
    normalize_advantage: True
    gamma: 0.99
    tau: 0.95
    learning_rate: 2e-3
    lr_schedule: adaptive
    max_epochs: ${resolve_default:12000,${....max_iterations}}
    grad_norm: 1.0
    entropy_coef: 0.00
    truncate_grads: True
    horizon_length: 32
    bounds_loss_coef: 0.0001
    num_actors: ${....task.env.numEnvs}   
    schedule_type: standard
    kl_threshold: 0.008
    score_to_win: 1000000
    max_frames: 10_000_000_000
    save_best_after: 100
    save_frequency: 1000
    print_stats: True
    e_clip: 0.1
    minibatch_size: 32768
    mini_epochs: 4
    critic_coef: 4.0
    clip_value: True
    seq_len: 32

    player:
      deterministic: True
      games_num: 8000

I have tried player.deterministic being both True and False but it does not change the issue for me. As this thread has cited inconsistencies between training and testing checkpoints for other agent instances, I wanted to follow up with the above question. I am using rl-games-1.6.1, and have tried with the best checkpoint as well as with the checkpoints normally saved every save_freq iterations that are labeled with highest rewards. Thanks!

@Denys88
Copy link
Owner

Denys88 commented Nov 19, 2023

@sashwat-mahalingam could you try a regular ant. Does it work?

@sashwat-mahalingam
Copy link

sashwat-mahalingam commented Nov 19, 2023

The regular Ant has the same issue. Below is the reward curve over epochs for two runs. The magenta one is an Ant policy I trained with PPO from scratch. The purple one is me training the Ant policy again from a checkpoint where the reward was 8039 (but at epoch 0 it achieves far below that reward in spite of having resumed from the checkpoint).

image

However, the testing reward does fine, achieving 9015.

@Denys88
Copy link
Owner

Denys88 commented Nov 19, 2023

@sashwat-mahalingam this one could be related to the way how I do reporting. When you restart a couple of first total reward reports would be reports from the failed ants because you don't get any results from good ants till 1000 step. I'll doublecheck it anywat.

@sashwat-mahalingam
Copy link

sashwat-mahalingam commented Nov 20, 2023

I see; this makes sense. Do you have any pointers as to what else I could be missing that would cause my environment to show the discrepancy between training and testing results (since for the Ant environment this is clearly not the case)? While I make sure to not apply randomizations to my custom environment, I am wondering if there are some other configurations I am missing that changes the testing environment drastically from the training ones when using the RLGames training setup. I've already discussed trying stochastic vs. deterministic players above, but beyond that from my read of the RLGames codebase I am not sure what other configurations might cause this environment to behave much differently in training/testing.

Here is an example of how my yaml looks for the actual environment itself:

# used to create the object
name: H1Reach

physics_engine: ${..physics_engine}

# if given, will override the device setting in gym.

# set to True if you use camera sensors in the environment
enableCameraSensors: True

env:
  numEnvs: ${resolve_default:8192,${...num_envs}}
  envSpacing: 4.0

  clipActions: 1.0

  plane:
    staticFriction: 2.0
    dynamicFriction: 2.0
    restitution: 0.0

  asset:
    assetRoot: "../../assets"
    assetFileName: "urdf/h1_description/h1_model.urdf"
  
  stiffness: {'left_hip_yaw_joint': 32.0, 'left_hip_roll_joint': 32.0, 'left_hip_pitch_joint': 32.0, 'left_knee_joint': 16.0, 'left_ankle_joint': 4.0, 'right_hip_yaw_joint': 32.0, 'right_hip_roll_joint': 32.0, 'right_hip_pitch_joint': 32.0, 'right_knee_joint': 16.0, 'right_ankle_joint': 4.0, 'torso_joint': 0.0, 'left_shoulder_pitch_joint': 14.0, 'left_shoulder_roll_joint': 14.0, 'left_shoulder_yaw_joint': 12.0, 'left_elbow_joint': 4.0, 'right_shoulder_pitch_joint': 14.0, 'right_shoulder_roll_joint': 14.0, 'right_shoulder_yaw_joint': 12.0, 'right_elbow_joint': 4.0}
  damping: {'left_hip_yaw_joint': 5.0, 'left_hip_roll_joint': 9.3, 'left_hip_pitch_joint': 9.0, 'left_knee_joint': 2.5, 'left_ankle_joint': 0.24, 'right_hip_yaw_joint': 5.0, 'right_hip_roll_joint': 9.3, 'right_hip_pitch_joint': 9.0, 'right_knee_joint': 2.5, 'right_ankle_joint': 0.24, 'torso_joint': 0.0, 'left_shoulder_pitch_joint': 2.6, 'left_shoulder_roll_joint': 2.5, 'left_shoulder_yaw_joint': 1.2, 'left_elbow_joint': 0.48, 'right_shoulder_pitch_joint': 2.6, 'right_shoulder_roll_joint': 2.5, 'right_shoulder_yaw_joint': 1.2, 'right_elbow_joint': 0.48}

  epLen: 700

  fixRobot: False

  aliveReward: 5.0
  offAxisScale: 1.0
  lowerHeightScale: 1.0
  velScale: 0.0
  proximityScale: 0.1

sim:
  dt: 0.0166 #0.0166 # 1/60 s
  substeps: 12
  up_axis: "z"
  use_gpu_pipeline: ${eq:${...pipeline},"gpu"}
  gravity: [0.0, 0.0, -9.81]
  physx:
    num_threads: ${....num_threads}
    solver_type: ${....solver_type}
    use_gpu: True # ${contains:"cuda",${....sim_device}} # set to False to run on CPU
    num_position_iterations: 4
    num_velocity_iterations: 0
    contact_offset: 0.002
    rest_offset: 0.0001
    bounce_threshold_velocity: 0.2
    max_depenetration_velocity: 1000.0
    default_buffer_size_multiplier: 5.0
    max_gpu_contact_pairs: 1048576 # 1024*1024
    num_subscenes: ${....num_subscenes}
    contact_collection: 0 # 0: CC_NEVER (don't collect contact info), 1: CC_LAST_SUBSTEP (collect only contacts on last substep), 2: CC_ALL_SUBSTEPS (broken - do not use!)

task:
  randomize: False

@sashwat-mahalingam
Copy link

I think the issue is that the physX solver is only deterministic under the same set of operations and seed. It would not act the exact same way if I try to load the checkpoint under the fixed seed or evaluate it vs train the model from scratch. I had assumed that maybe the distribution of physics steps would be the same given just the seed, but it seems like this is not guaranteed either and so I can’t expect to achieve the same expected performance during reruns. As I do not do domain randomization which the other tasks all seem to do, could adding this in fix the problem?

@StefanoBraghetto
Copy link

To resume you have to put inside the runner.run the "checkpoint" path like this:

agent_cfg["params"]["load_checkpoint"] = True
agent_cfg["params"]["load_path"] = resume_path
runner = Runner(IsaacAlgoObserver())
runner.load(agent_cfg)
runner.run({"train": True, "play": False, "sigma": None, "checkpoint": resume_path})

hope this help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants