-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataloader #83
Dataloader #83
Conversation
# Spawn training to GPUs | ||
distribute.spawn_train(train, cfg, world_size, device) | ||
torch.multiprocessing.set_start_method("spawn") | ||
verbose, world_size = distribute.setup(local_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running it on single node shows the following error:
Error executing job with overrides: []
Traceback (most recent call last):
File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 817, in main
verbose, world_size = distribute.setup(local_rank)
UnboundLocalError: local variable 'local_rank' referenced before assignment
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Now, it works with my custom data on a single node and multi-nodes. I haven't checked with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the resume mode, simulator.module.load()
seems to should be changed to simulator.load()
although the previous version of parallel GNS used to work with simulator.module.load()
in the distributed setting.
Error executing job with overrides: []
Traceback (most recent call last):
File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 843, in main
train(local_rank, cfg, world_size, device, verbose, use_dist)
File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 475, in train
simulator.module.load(cfg.model.path + cfg.model.file)
File "/work2/08264/baagee/frontera/venvs/venv-frontera-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LearnedSimulator' object has no attribute 'module'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skye-glitch : Could you address this please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the updated code.
Also, there is a minor error about When |
Should have been fixed |
Another minor issue is that |
Fixed. Thanks! |
@yjchoi1 Could you check to see if everything is good to merge? |
After addressing above comment about single node issue, everything seems good to merge. |
Describe the PR
Supports both npz and hdf5 data format
Related Issues/PRs
#82
Additional Context
Will remove data_loader.py once we have merged multinode training to train.py