Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It did not converge on the ogbn-arxiv dataset. #10

Open
JiaLonghao1997 opened this issue May 28, 2024 · 1 comment
Open

It did not converge on the ogbn-arxiv dataset. #10

JiaLonghao1997 opened this issue May 28, 2024 · 1 comment

Comments

@JiaLonghao1997
Copy link

When running on ogbn-arxiv, we found the following error.

20240528-09:53:39: output_dir: /public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0
20240528-09:53:39: Total 169343 nodes.
20240528-09:53:39: Total 2501829 edges.
20240528-09:53:39: Load data with max_memory_allocated: 0.0000Gb | max_memory_cached: 0.0000Gb
20240528-09:53:39: conf: {'device': device(type='cuda', index=0), 'seed': 0, 'log_level': 20, 'console_log': True, 'output_path': 'outputs', 'num_exp': 1, 'exp_setting': 'tran', 'eval_interval': 1, 'save_results': False, 'dataset': 'ogbn-arxiv', 'data_path': './data', 'labelrate_train': 20, 'labelrate_val': 30, 'split_idx': 0, 'codebook_size': 32768, 'lamb_node': 0.001, 'lamb_edge': 0.03, 'model_config_path': '/public/home/jialh/metaHiC/models/01VQGraph/obgn_arxiv.conf.yaml', 'teacher': 'SAGE', 'num_layers': 2, 'hidden_dim': 256, 'dropout_ratio': 0.2, 'norm_type': 'batch', 'batch_size': 512, 'fan_out': '5,10', 'num_workers': 0, 'learning_rate': 0.01, 'weight_decay': 0, 'max_epoch': 100, 'patience': 50, 'feature_noise': 0, 'split_rate': 0.2, 'compute_min_cut': False, 'feature_aug_k': 0, 'output_dir': PosixPath('/public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0'), 'feat_dim': 128, 'label_dim': 40, 'model_name': 'SAGE'}
Traceback (most recent call last):
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 244, in <module>
    main()
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 227, in main
    score = run(args)
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 132, in run
    out, score_val, score_test, h_list, dist, codebook = run_transductive(
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_and_eval.py", line 295, in run_transductive
    out, loss_train, score_train,  _, _, _ = evaluate(
  File "/public/home/jialh/metaHiC/models/01VQGraph/train_and_eval.py", line 148, in evaluate
    h_list, logits, _ , dist, codebook = model.inference(data, feats)
  File "/public/home/jialh/metaHiC/models/01VQGraph/models.py", line 519, in inference
    return self.encoder.inference(data, feats)
  File "/public/home/jialh/metaHiC/models/01VQGraph/models.py", line 243, in inference
    dist_all = torch.zeros(feats.shape[0], self.codebook_size, device=device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.67 GiB (GPU 0; 10.91 GiB total capacity; 101.55 MiB already allocated; 10.13 GiB free; 104.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In order to avoid CUDA out of memory, we set batch_size=64, codebook_size=8192, num_layers=2, fan_out=5,10, and we found that it did not converge.

20240528-10:00:35: output_dir: /public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0
20240528-10:00:35: Total 169343 nodes.
20240528-10:00:35: Total 2501829 edges.
20240528-10:00:35: Load data with max_memory_allocated: 0.0000Gb | max_memory_cached: 0.0000Gb
20240528-10:00:35: conf: {'device': device(type='cuda', index=0), 'seed': 0, 'log_level': 20, 'console_log': True, 'output_path': 'outputs', 'num_exp': 1, 'exp_setting': 'tran', 'eval_interval': 1, 'save_results': False, 'dataset': 'ogbn-arxiv', 'data_path': './data', 'labelrate_train': 20, 'labelrate_val': 30, 'split_idx': 0, 'codebook_size': 8192, 'lamb_node': 0.001, 'lamb_edge': 0.03, 'model_config_path': '/public/home/jialh/metaHiC/models/01VQGraph/obgn_arxiv.conf.yaml', 'teacher': 'SAGE', 'num_layers': 2, 'hidden_dim': 256, 'dropout_ratio': 0.2, 'norm_type': 'batch', 'batch_size': 64, 'fan_out': '5,10', 'num_workers': 0, 'learning_rate': 0.01, 'weight_decay': 0, 'max_epoch': 100, 'patience': 50, 'feature_noise': 0, 'split_rate': 0.2, 'compute_min_cut': False, 'feature_aug_k': 0, 'output_dir': PosixPath('/public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0'), 'feat_dim': 128, 'label_dim': 40, 'model_name': 'SAGE'}
20240528-10:01:21: out.size(): torch.Size([169343, 40])
20240528-10:01:21: Ep   1 | max_memory_allocated: 8.4489Gb | loss: 2.4847 | s_train: 0.3918 | s_val: 0.4154 | s_test: 0.3865
20240528-10:02:03: out.size(): torch.Size([169343, 40])
20240528-10:02:03: Ep   2 | max_memory_allocated: 8.4784Gb | loss: 5.3593 | s_train: 0.3868 | s_val: 0.4248 | s_test: 0.4366
20240528-10:02:44: out.size(): torch.Size([169343, 40])
20240528-10:02:44: Ep   3 | max_memory_allocated: 8.4784Gb | loss: 9.8887 | s_train: 0.4054 | s_val: 0.4238 | s_test: 0.4109
20240528-10:03:26: out.size(): torch.Size([169343, 40])
20240528-10:03:26: Ep   4 | max_memory_allocated: 8.4784Gb | loss: 14.7743 | s_train: 0.4291 | s_val: 0.4472 | s_test: 0.4399
20240528-10:04:08: out.size(): torch.Size([169343, 40])
20240528-10:04:08: Ep   5 | max_memory_allocated: 8.4788Gb | loss: 19.6258 | s_train: 0.4261 | s_val: 0.4569 | s_test: 0.4425
20240528-10:04:50: out.size(): torch.Size([169343, 40])
20240528-10:04:50: Ep   6 | max_memory_allocated: 8.4788Gb | loss: 24.9095 | s_train: 0.4253 | s_val: 0.4276 | s_test: 0.4159
20240528-10:05:32: out.size(): torch.Size([169343, 40])
20240528-10:05:32: Ep   7 | max_memory_allocated: 8.4788Gb | loss: 30.3602 | s_train: 0.4224 | s_val: 0.4353 | s_test: 0.4223
20240528-10:06:13: out.size(): torch.Size([169343, 40])
20240528-10:06:13: Ep   8 | max_memory_allocated: 8.4788Gb | loss: 35.7189 | s_train: 0.4145 | s_val: 0.4387 | s_test: 0.4437
20240528-10:06:55: out.size(): torch.Size([169343, 40])
20240528-10:06:55: Ep   9 | max_memory_allocated: 8.4788Gb | loss: 41.0788 | s_train: 0.4183 | s_val: 0.4456 | s_test: 0.4395
20240528-10:07:37: out.size(): torch.Size([169343, 40])
20240528-10:07:37: Ep  10 | max_memory_allocated: 8.4788Gb | loss: 46.3140 | s_train: 0.4137 | s_val: 0.4240 | s_test: 0.4146
20240528-10:08:18: out.size(): torch.Size([169343, 40])
20240528-10:08:18: Ep  11 | max_memory_allocated: 8.4788Gb | loss: 51.6967 | s_train: 0.3891 | s_val: 0.3980 | s_test: 0.3784
20240528-10:09:00: out.size(): torch.Size([169343, 40])
20240528-10:09:00: Ep  12 | max_memory_allocated: 8.4788Gb | loss: 57.0279 | s_train: 0.4157 | s_val: 0.4148 | s_test: 0.4079
20240528-10:09:42: out.size(): torch.Size([169343, 40])
20240528-10:09:42: Ep  13 | max_memory_allocated: 8.4788Gb | loss: 62.2806 | s_train: 0.4147 | s_val: 0.4572 | s_test: 0.4607
20240528-10:10:24: out.size(): torch.Size([169343, 40])
20240528-10:10:24: Ep  14 | max_memory_allocated: 8.4788Gb | loss: 67.5761 | s_train: 0.4291 | s_val: 0.4422 | s_test: 0.4360
20240528-10:11:06: out.size(): torch.Size([169343, 40])
20240528-10:11:06: Ep  15 | max_memory_allocated: 8.4788Gb | loss: 73.1767 | s_train: 0.4107 | s_val: 0.4147 | s_test: 0.3931
20240528-10:11:48: out.size(): torch.Size([169343, 40])
20240528-10:11:48: Ep  16 | max_memory_allocated: 8.4788Gb | loss: 79.3345 | s_train: 0.4260 | s_val: 0.4328 | s_test: 0.4248
20240528-10:12:30: out.size(): torch.Size([169343, 40])
20240528-10:12:30: Ep  17 | max_memory_allocated: 8.4788Gb | loss: 86.1251 | s_train: 0.4152 | s_val: 0.4019 | s_test: 0.4046
20240528-10:13:12: out.size(): torch.Size([169343, 40])
20240528-10:13:12: Ep  18 | max_memory_allocated: 8.4788Gb | loss: 92.6365 | s_train: 0.4112 | s_val: 0.4315 | s_test: 0.4274
20240528-10:13:54: out.size(): torch.Size([169343, 40])
20240528-10:13:54: Ep  19 | max_memory_allocated: 8.4788Gb | loss: 99.6484 | s_train: 0.4001 | s_val: 0.3916 | s_test: 0.3596
20240528-10:14:36: out.size(): torch.Size([169343, 40])
20240528-10:14:36: Ep  20 | max_memory_allocated: 8.4788Gb | loss: 106.8252 | s_train: 0.3850 | s_val: 0.3665 | s_test: 0.3562
20240528-10:15:18: out.size(): torch.Size([169343, 40])
20240528-10:15:18: Ep  21 | max_memory_allocated: 8.4788Gb | loss: 115.3929 | s_train: 0.3980 | s_val: 0.3825 | s_test: 0.3524
20240528-10:16:00: out.size(): torch.Size([169343, 40])
20240528-10:16:00: Ep  22 | max_memory_allocated: 8.4788Gb | loss: 124.5834 | s_train: 0.4036 | s_val: 0.3981 | s_test: 0.4074
20240528-10:16:42: out.size(): torch.Size([169343, 40])
20240528-10:16:42: Ep  23 | max_memory_allocated: 8.4788Gb | loss: 135.1810 | s_train: 0.4047 | s_val: 0.4172 | s_test: 0.4079
20240528-10:17:24: out.size(): torch.Size([169343, 40])
20240528-10:17:24: Ep  24 | max_memory_allocated: 8.4788Gb | loss: 147.3654 | s_train: 0.4042 | s_val: 0.4236 | s_test: 0.4325
20240528-10:18:05: out.size(): torch.Size([169343, 40])
20240528-10:18:05: Ep  25 | max_memory_allocated: 8.4788Gb | loss: 160.5010 | s_train: 0.4014 | s_val: 0.4227 | s_test: 0.3989
20240528-10:18:47: out.size(): torch.Size([169343, 40])
20240528-10:18:47: Ep  26 | max_memory_allocated: 8.4788Gb | loss: 175.0141 | s_train: 0.3878 | s_val: 0.3512 | s_test: 0.3322
20240528-10:19:30: out.size(): torch.Size([169343, 40])
20240528-10:19:30: Ep  27 | max_memory_allocated: 8.4788Gb | loss: 192.3251 | s_train: 0.3712 | s_val: 0.4209 | s_test: 0.4286
20240528-10:20:11: out.size(): torch.Size([169343, 40])
20240528-10:20:11: Ep  28 | max_memory_allocated: 8.4788Gb | loss: 212.6606 | s_train: 0.3490 | s_val: 0.3544 | s_test: 0.3758
20240528-10:20:54: out.size(): torch.Size([169343, 40])
20240528-10:20:54: Ep  29 | max_memory_allocated: 8.4788Gb | loss: 237.1151 | s_train: 0.3840 | s_val: 0.3876 | s_test: 0.3846
20240528-10:21:36: out.size(): torch.Size([169343, 40])
20240528-10:21:36: Ep  30 | max_memory_allocated: 8.4788Gb | loss: 265.4982 | s_train: 0.3654 | s_val: 0.3808 | s_test: 0.3745
20240528-10:22:18: out.size(): torch.Size([169343, 40])
20240528-10:22:18: Ep  31 | max_memory_allocated: 8.4788Gb | loss: 299.6145 | s_train: 0.3932 | s_val: 0.4149 | s_test: 0.4051
20240528-10:23:00: out.size(): torch.Size([169343, 40])
20240528-10:23:00: Ep  32 | max_memory_allocated: 8.4788Gb | loss: 343.6816 | s_train: 0.3656 | s_val: 0.3318 | s_test: 0.3026
20240528-10:23:42: out.size(): torch.Size([169343, 40])
20240528-10:23:42: Ep  33 | max_memory_allocated: 8.4788Gb | loss: 396.8048 | s_train: 0.3733 | s_val: 0.3808 | s_test: 0.3688
20240528-10:24:23: out.size(): torch.Size([169343, 40])
20240528-10:24:23: Ep  34 | max_memory_allocated: 8.4788Gb | loss: 461.6037 | s_train: 0.3869 | s_val: 0.4067 | s_test: 0.3970
20240528-10:25:05: out.size(): torch.Size([169343, 40])
20240528-10:25:05: Ep  35 | max_memory_allocated: 8.4788Gb | loss: 541.3137 | s_train: 0.3899 | s_val: 0.4165 | s_test: 0.4132
20240528-10:25:47: out.size(): torch.Size([169343, 40])
20240528-10:25:47: Ep  36 | max_memory_allocated: 8.4788Gb | loss: 642.1798 | s_train: 0.4000 | s_val: 0.4126 | s_test: 0.3850
20240528-10:26:29: out.size(): torch.Size([169343, 40])
20240528-10:26:29: Ep  37 | max_memory_allocated: 8.4788Gb | loss: 762.5302 | s_train: 0.4127 | s_val: 0.4166 | s_test: 0.3943
20240528-10:27:11: out.size(): torch.Size([169343, 40])
20240528-10:27:11: Ep  38 | max_memory_allocated: 8.4788Gb | loss: 897.7273 | s_train: 0.3879 | s_val: 0.3960 | s_test: 0.3696
20240528-10:27:53: out.size(): torch.Size([169343, 40])
20240528-10:27:53: Ep  39 | max_memory_allocated: 8.4788Gb | loss: 1049.9378 | s_train: 0.3979 | s_val: 0.4002 | s_test: 0.3817
20240528-10:28:36: out.size(): torch.Size([169343, 40])
20240528-10:28:36: Ep  40 | max_memory_allocated: 8.4788Gb | loss: 1211.5399 | s_train: 0.4186 | s_val: 0.4271 | s_test: 0.4065
20240528-10:29:18: out.size(): torch.Size([169343, 40])
@hxu105
Copy link

hxu105 commented Dec 31, 2024

Having the same issue on ogbn-arxiv graph tokenization training. The model cannot converge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants