Dataloader #83

kks32 · 2024-06-27T13:58:38Z

Describe the PR
Supports both npz and hdf5 data format

Related Issues/PRs
#82

Additional Context
Will remove data_loader.py once we have merged multinode training to train.py

yjchoi1 · 2024-07-02T05:00:10Z

gns/train.py

-            # Spawn training to GPUs
-            distribute.spawn_train(train, cfg, world_size, device)
+            torch.multiprocessing.set_start_method("spawn")
+            verbose, world_size = distribute.setup(local_rank)


Running it on single node shows the following error:

Error executing job with overrides: [] Traceback (most recent call last): File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 817, in main verbose, world_size = distribute.setup(local_rank) UnboundLocalError: local variable 'local_rank' referenced before assignment Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

…dataloader

yjchoi1 · 2024-07-10T05:01:40Z

Now, it works with my custom data on a single node and multi-nodes. I haven't checked with .h5 though.

yjchoi1 · 2024-07-10T05:18:44Z

gns/train.py

In the resume mode, simulator.module.load() seems to should be changed to simulator.load() although the previous version of parallel GNS used to work with simulator.module.load() in the distributed setting.

Error executing job with overrides: [] Traceback (most recent call last): File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 843, in main train(local_rank, cfg, world_size, device, verbose, use_dist) File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 475, in train simulator.module.load(cfg.model.path + cfg.model.file) File "/work2/08264/baagee/frontera/venvs/venv-frontera-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'LearnedSimulator' object has no attribute 'module'

@skye-glitch : Could you address this please.

Please check the updated code.

yjchoi1 · 2024-07-10T05:25:51Z

Also, there is a minor error about tqdm count.

When resume, tqdm does not show the correct nstep. For example, given that the model is saved at 100/500000 step at epoch=0, when we resume from model-100.pt, tqdm count starts from 0/500000 rather than starting from 100/500000, although the code correctly loaded the model at 100 step.

skye-glitch · 2024-07-11T16:41:58Z

Also, there is a minor error about tqdm count.

When resume, tqdm does not show the correct nstep. For example, given that the model is saved at 100/500000 step at epoch=0, when we resume from model-100.pt, tqdm count starts from 0/500000 rather than starting from 100/500000, although the code correctly loaded the model at 100 step.

Should have been fixed

yjchoi1 · 2024-07-17T14:12:49Z

Another minor issue is that gns/args.py still has n_gpus arg although we don't use them.

skye-glitch · 2024-07-17T14:47:36Z

Another minor issue is that gns/args.py still has n_gpus arg although we don't use them.

Fixed. Thanks！

kks32 · 2024-07-17T23:14:25Z

@yjchoi1 Could you check to see if everything is good to merge?

yjchoi1 · 2024-07-17T23:27:01Z

After addressing above comment about single node issue, everything seems good to merge.

kks32 added 30 commits June 25, 2024 07:57

Initial implementation with hydra

dfda24f

Test CI for config yaml

50c63c2

Testing CI for hydra

c93c335

Try pip requirements.txt

13c5204

Try pip instead of conda for docker

7ebded9

Docker build on GitHub

85ff713

Copy requirements.txt file before installing on Docker container

e607f98

Copy requirements.txt

9913d92

Copy requirements.txt

111b428

Trying with user flag

3606790

Trying Python 3.11

b6cea3e

Test CircleCI with ghcr container image

f9cb06d

GitHub Actions workflow to test training GNS

5d8ce23

Updated dockerfile with paths and env

b230de8

Modify workflow to run training

20fa700

Add at least one epoch to run when nsteps is fewer than 1 epoch steps

7764f9d

Train GNS action

1e612df

Test without docker pull on CircleCI

1f0b8cd

Specify path and branches

8c0ddad

Fix path to GNS sample output

4a7e5c4

Worflow runs on Github and remove conda on circleci

c5bcc44

No black check

dc27691

Only try to build container if specific files have changed

5329b78

Fix resume training and README

a97cd5e

Reduce number of steps to 100 for testing

170037b

Refactor constants to data

40b4919

Remove on PR

c6b092d

Add config to tensorboard writer

14f3b9e

Particle data loader

5bab43b

Add tests for data loader

8a3a540

Sikan Li and others added 11 commits June 28, 2024 13:20

n_gpus

79cd9f4

update Dockerfile

748b9f4

reformat

a74ced0

Remove unused dataloader

768702a

GPU cocntainer

0a28073

Remove blank lines in GPU container

b1081b9

Update README with container image

3202caf

Add GitHub badge

291ce9a

WIP: Refactor train

50fe084

Fix validation dataloader

27919d1

Prepare data function

985a68a

yjchoi1 reviewed Jul 2, 2024

View reviewed changes

Sikan Li added 6 commits July 2, 2024 08:47

check scripts

0e79137

Merge branch 'dataloader' of https://github.com/geoelements/gns into …

7748b64

…dataloader

remove extra file

1a8546c

use python to launch

7f31242

black

2bcd131

update test

86b4dbb

yjchoi1 reviewed Jul 10, 2024

View reviewed changes

fix loading simulator and resuming from middle of epoch

f72b172

remove n_gpus

8f4e92c

kks32 merged commit 325fc8d into v2 Jul 24, 2024
1 check passed

kks32 deleted the dataloader branch July 24, 2024 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader #83

Dataloader #83

kks32 commented Jun 27, 2024 •

edited

Loading

yjchoi1 Jul 2, 2024

yjchoi1 commented Jul 10, 2024

yjchoi1 Jul 10, 2024 •

edited

Loading

kks32 Jul 10, 2024

skye-glitch Jul 11, 2024

yjchoi1 commented Jul 10, 2024 •

edited

Loading

skye-glitch commented Jul 11, 2024

yjchoi1 commented Jul 17, 2024 •

edited

Loading

skye-glitch commented Jul 17, 2024

kks32 commented Jul 17, 2024

yjchoi1 commented Jul 17, 2024

Dataloader #83

Dataloader #83

Conversation

kks32 commented Jun 27, 2024 • edited Loading

yjchoi1 Jul 2, 2024

Choose a reason for hiding this comment

yjchoi1 commented Jul 10, 2024

yjchoi1 Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

kks32 Jul 10, 2024

Choose a reason for hiding this comment

skye-glitch Jul 11, 2024

Choose a reason for hiding this comment

yjchoi1 commented Jul 10, 2024 • edited Loading

skye-glitch commented Jul 11, 2024

yjchoi1 commented Jul 17, 2024 • edited Loading

skye-glitch commented Jul 17, 2024

kks32 commented Jul 17, 2024

yjchoi1 commented Jul 17, 2024

kks32 commented Jun 27, 2024 •

edited

Loading

yjchoi1 Jul 10, 2024 •

edited

Loading

yjchoi1 commented Jul 10, 2024 •

edited

Loading

yjchoi1 commented Jul 17, 2024 •

edited

Loading