We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running on ogbn-arxiv, we found the following error.
20240528-09:53:39: output_dir: /public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0 20240528-09:53:39: Total 169343 nodes. 20240528-09:53:39: Total 2501829 edges. 20240528-09:53:39: Load data with max_memory_allocated: 0.0000Gb | max_memory_cached: 0.0000Gb 20240528-09:53:39: conf: {'device': device(type='cuda', index=0), 'seed': 0, 'log_level': 20, 'console_log': True, 'output_path': 'outputs', 'num_exp': 1, 'exp_setting': 'tran', 'eval_interval': 1, 'save_results': False, 'dataset': 'ogbn-arxiv', 'data_path': './data', 'labelrate_train': 20, 'labelrate_val': 30, 'split_idx': 0, 'codebook_size': 32768, 'lamb_node': 0.001, 'lamb_edge': 0.03, 'model_config_path': '/public/home/jialh/metaHiC/models/01VQGraph/obgn_arxiv.conf.yaml', 'teacher': 'SAGE', 'num_layers': 2, 'hidden_dim': 256, 'dropout_ratio': 0.2, 'norm_type': 'batch', 'batch_size': 512, 'fan_out': '5,10', 'num_workers': 0, 'learning_rate': 0.01, 'weight_decay': 0, 'max_epoch': 100, 'patience': 50, 'feature_noise': 0, 'split_rate': 0.2, 'compute_min_cut': False, 'feature_aug_k': 0, 'output_dir': PosixPath('/public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0'), 'feat_dim': 128, 'label_dim': 40, 'model_name': 'SAGE'} Traceback (most recent call last): File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 244, in <module> main() File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 227, in main score = run(args) File "/public/home/jialh/metaHiC/models/01VQGraph/train_teacher.py", line 132, in run out, score_val, score_test, h_list, dist, codebook = run_transductive( File "/public/home/jialh/metaHiC/models/01VQGraph/train_and_eval.py", line 295, in run_transductive out, loss_train, score_train, _, _, _ = evaluate( File "/public/home/jialh/metaHiC/models/01VQGraph/train_and_eval.py", line 148, in evaluate h_list, logits, _ , dist, codebook = model.inference(data, feats) File "/public/home/jialh/metaHiC/models/01VQGraph/models.py", line 519, in inference return self.encoder.inference(data, feats) File "/public/home/jialh/metaHiC/models/01VQGraph/models.py", line 243, in inference dist_all = torch.zeros(feats.shape[0], self.codebook_size, device=device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.67 GiB (GPU 0; 10.91 GiB total capacity; 101.55 MiB already allocated; 10.13 GiB free; 104.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
In order to avoid CUDA out of memory, we set batch_size=64, codebook_size=8192, num_layers=2, fan_out=5,10, and we found that it did not converge.
20240528-10:00:35: output_dir: /public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0 20240528-10:00:35: Total 169343 nodes. 20240528-10:00:35: Total 2501829 edges. 20240528-10:00:35: Load data with max_memory_allocated: 0.0000Gb | max_memory_cached: 0.0000Gb 20240528-10:00:35: conf: {'device': device(type='cuda', index=0), 'seed': 0, 'log_level': 20, 'console_log': True, 'output_path': 'outputs', 'num_exp': 1, 'exp_setting': 'tran', 'eval_interval': 1, 'save_results': False, 'dataset': 'ogbn-arxiv', 'data_path': './data', 'labelrate_train': 20, 'labelrate_val': 30, 'split_idx': 0, 'codebook_size': 8192, 'lamb_node': 0.001, 'lamb_edge': 0.03, 'model_config_path': '/public/home/jialh/metaHiC/models/01VQGraph/obgn_arxiv.conf.yaml', 'teacher': 'SAGE', 'num_layers': 2, 'hidden_dim': 256, 'dropout_ratio': 0.2, 'norm_type': 'batch', 'batch_size': 64, 'fan_out': '5,10', 'num_workers': 0, 'learning_rate': 0.01, 'weight_decay': 0, 'max_epoch': 100, 'patience': 50, 'feature_noise': 0, 'split_rate': 0.2, 'compute_min_cut': False, 'feature_aug_k': 0, 'output_dir': PosixPath('/public/home/jialh/metaHiC/models/01VQGraph/outputs/transductive/ogbn-arxiv/SAGE/seed_0'), 'feat_dim': 128, 'label_dim': 40, 'model_name': 'SAGE'} 20240528-10:01:21: out.size(): torch.Size([169343, 40]) 20240528-10:01:21: Ep 1 | max_memory_allocated: 8.4489Gb | loss: 2.4847 | s_train: 0.3918 | s_val: 0.4154 | s_test: 0.3865 20240528-10:02:03: out.size(): torch.Size([169343, 40]) 20240528-10:02:03: Ep 2 | max_memory_allocated: 8.4784Gb | loss: 5.3593 | s_train: 0.3868 | s_val: 0.4248 | s_test: 0.4366 20240528-10:02:44: out.size(): torch.Size([169343, 40]) 20240528-10:02:44: Ep 3 | max_memory_allocated: 8.4784Gb | loss: 9.8887 | s_train: 0.4054 | s_val: 0.4238 | s_test: 0.4109 20240528-10:03:26: out.size(): torch.Size([169343, 40]) 20240528-10:03:26: Ep 4 | max_memory_allocated: 8.4784Gb | loss: 14.7743 | s_train: 0.4291 | s_val: 0.4472 | s_test: 0.4399 20240528-10:04:08: out.size(): torch.Size([169343, 40]) 20240528-10:04:08: Ep 5 | max_memory_allocated: 8.4788Gb | loss: 19.6258 | s_train: 0.4261 | s_val: 0.4569 | s_test: 0.4425 20240528-10:04:50: out.size(): torch.Size([169343, 40]) 20240528-10:04:50: Ep 6 | max_memory_allocated: 8.4788Gb | loss: 24.9095 | s_train: 0.4253 | s_val: 0.4276 | s_test: 0.4159 20240528-10:05:32: out.size(): torch.Size([169343, 40]) 20240528-10:05:32: Ep 7 | max_memory_allocated: 8.4788Gb | loss: 30.3602 | s_train: 0.4224 | s_val: 0.4353 | s_test: 0.4223 20240528-10:06:13: out.size(): torch.Size([169343, 40]) 20240528-10:06:13: Ep 8 | max_memory_allocated: 8.4788Gb | loss: 35.7189 | s_train: 0.4145 | s_val: 0.4387 | s_test: 0.4437 20240528-10:06:55: out.size(): torch.Size([169343, 40]) 20240528-10:06:55: Ep 9 | max_memory_allocated: 8.4788Gb | loss: 41.0788 | s_train: 0.4183 | s_val: 0.4456 | s_test: 0.4395 20240528-10:07:37: out.size(): torch.Size([169343, 40]) 20240528-10:07:37: Ep 10 | max_memory_allocated: 8.4788Gb | loss: 46.3140 | s_train: 0.4137 | s_val: 0.4240 | s_test: 0.4146 20240528-10:08:18: out.size(): torch.Size([169343, 40]) 20240528-10:08:18: Ep 11 | max_memory_allocated: 8.4788Gb | loss: 51.6967 | s_train: 0.3891 | s_val: 0.3980 | s_test: 0.3784 20240528-10:09:00: out.size(): torch.Size([169343, 40]) 20240528-10:09:00: Ep 12 | max_memory_allocated: 8.4788Gb | loss: 57.0279 | s_train: 0.4157 | s_val: 0.4148 | s_test: 0.4079 20240528-10:09:42: out.size(): torch.Size([169343, 40]) 20240528-10:09:42: Ep 13 | max_memory_allocated: 8.4788Gb | loss: 62.2806 | s_train: 0.4147 | s_val: 0.4572 | s_test: 0.4607 20240528-10:10:24: out.size(): torch.Size([169343, 40]) 20240528-10:10:24: Ep 14 | max_memory_allocated: 8.4788Gb | loss: 67.5761 | s_train: 0.4291 | s_val: 0.4422 | s_test: 0.4360 20240528-10:11:06: out.size(): torch.Size([169343, 40]) 20240528-10:11:06: Ep 15 | max_memory_allocated: 8.4788Gb | loss: 73.1767 | s_train: 0.4107 | s_val: 0.4147 | s_test: 0.3931 20240528-10:11:48: out.size(): torch.Size([169343, 40]) 20240528-10:11:48: Ep 16 | max_memory_allocated: 8.4788Gb | loss: 79.3345 | s_train: 0.4260 | s_val: 0.4328 | s_test: 0.4248 20240528-10:12:30: out.size(): torch.Size([169343, 40]) 20240528-10:12:30: Ep 17 | max_memory_allocated: 8.4788Gb | loss: 86.1251 | s_train: 0.4152 | s_val: 0.4019 | s_test: 0.4046 20240528-10:13:12: out.size(): torch.Size([169343, 40]) 20240528-10:13:12: Ep 18 | max_memory_allocated: 8.4788Gb | loss: 92.6365 | s_train: 0.4112 | s_val: 0.4315 | s_test: 0.4274 20240528-10:13:54: out.size(): torch.Size([169343, 40]) 20240528-10:13:54: Ep 19 | max_memory_allocated: 8.4788Gb | loss: 99.6484 | s_train: 0.4001 | s_val: 0.3916 | s_test: 0.3596 20240528-10:14:36: out.size(): torch.Size([169343, 40]) 20240528-10:14:36: Ep 20 | max_memory_allocated: 8.4788Gb | loss: 106.8252 | s_train: 0.3850 | s_val: 0.3665 | s_test: 0.3562 20240528-10:15:18: out.size(): torch.Size([169343, 40]) 20240528-10:15:18: Ep 21 | max_memory_allocated: 8.4788Gb | loss: 115.3929 | s_train: 0.3980 | s_val: 0.3825 | s_test: 0.3524 20240528-10:16:00: out.size(): torch.Size([169343, 40]) 20240528-10:16:00: Ep 22 | max_memory_allocated: 8.4788Gb | loss: 124.5834 | s_train: 0.4036 | s_val: 0.3981 | s_test: 0.4074 20240528-10:16:42: out.size(): torch.Size([169343, 40]) 20240528-10:16:42: Ep 23 | max_memory_allocated: 8.4788Gb | loss: 135.1810 | s_train: 0.4047 | s_val: 0.4172 | s_test: 0.4079 20240528-10:17:24: out.size(): torch.Size([169343, 40]) 20240528-10:17:24: Ep 24 | max_memory_allocated: 8.4788Gb | loss: 147.3654 | s_train: 0.4042 | s_val: 0.4236 | s_test: 0.4325 20240528-10:18:05: out.size(): torch.Size([169343, 40]) 20240528-10:18:05: Ep 25 | max_memory_allocated: 8.4788Gb | loss: 160.5010 | s_train: 0.4014 | s_val: 0.4227 | s_test: 0.3989 20240528-10:18:47: out.size(): torch.Size([169343, 40]) 20240528-10:18:47: Ep 26 | max_memory_allocated: 8.4788Gb | loss: 175.0141 | s_train: 0.3878 | s_val: 0.3512 | s_test: 0.3322 20240528-10:19:30: out.size(): torch.Size([169343, 40]) 20240528-10:19:30: Ep 27 | max_memory_allocated: 8.4788Gb | loss: 192.3251 | s_train: 0.3712 | s_val: 0.4209 | s_test: 0.4286 20240528-10:20:11: out.size(): torch.Size([169343, 40]) 20240528-10:20:11: Ep 28 | max_memory_allocated: 8.4788Gb | loss: 212.6606 | s_train: 0.3490 | s_val: 0.3544 | s_test: 0.3758 20240528-10:20:54: out.size(): torch.Size([169343, 40]) 20240528-10:20:54: Ep 29 | max_memory_allocated: 8.4788Gb | loss: 237.1151 | s_train: 0.3840 | s_val: 0.3876 | s_test: 0.3846 20240528-10:21:36: out.size(): torch.Size([169343, 40]) 20240528-10:21:36: Ep 30 | max_memory_allocated: 8.4788Gb | loss: 265.4982 | s_train: 0.3654 | s_val: 0.3808 | s_test: 0.3745 20240528-10:22:18: out.size(): torch.Size([169343, 40]) 20240528-10:22:18: Ep 31 | max_memory_allocated: 8.4788Gb | loss: 299.6145 | s_train: 0.3932 | s_val: 0.4149 | s_test: 0.4051 20240528-10:23:00: out.size(): torch.Size([169343, 40]) 20240528-10:23:00: Ep 32 | max_memory_allocated: 8.4788Gb | loss: 343.6816 | s_train: 0.3656 | s_val: 0.3318 | s_test: 0.3026 20240528-10:23:42: out.size(): torch.Size([169343, 40]) 20240528-10:23:42: Ep 33 | max_memory_allocated: 8.4788Gb | loss: 396.8048 | s_train: 0.3733 | s_val: 0.3808 | s_test: 0.3688 20240528-10:24:23: out.size(): torch.Size([169343, 40]) 20240528-10:24:23: Ep 34 | max_memory_allocated: 8.4788Gb | loss: 461.6037 | s_train: 0.3869 | s_val: 0.4067 | s_test: 0.3970 20240528-10:25:05: out.size(): torch.Size([169343, 40]) 20240528-10:25:05: Ep 35 | max_memory_allocated: 8.4788Gb | loss: 541.3137 | s_train: 0.3899 | s_val: 0.4165 | s_test: 0.4132 20240528-10:25:47: out.size(): torch.Size([169343, 40]) 20240528-10:25:47: Ep 36 | max_memory_allocated: 8.4788Gb | loss: 642.1798 | s_train: 0.4000 | s_val: 0.4126 | s_test: 0.3850 20240528-10:26:29: out.size(): torch.Size([169343, 40]) 20240528-10:26:29: Ep 37 | max_memory_allocated: 8.4788Gb | loss: 762.5302 | s_train: 0.4127 | s_val: 0.4166 | s_test: 0.3943 20240528-10:27:11: out.size(): torch.Size([169343, 40]) 20240528-10:27:11: Ep 38 | max_memory_allocated: 8.4788Gb | loss: 897.7273 | s_train: 0.3879 | s_val: 0.3960 | s_test: 0.3696 20240528-10:27:53: out.size(): torch.Size([169343, 40]) 20240528-10:27:53: Ep 39 | max_memory_allocated: 8.4788Gb | loss: 1049.9378 | s_train: 0.3979 | s_val: 0.4002 | s_test: 0.3817 20240528-10:28:36: out.size(): torch.Size([169343, 40]) 20240528-10:28:36: Ep 40 | max_memory_allocated: 8.4788Gb | loss: 1211.5399 | s_train: 0.4186 | s_val: 0.4271 | s_test: 0.4065 20240528-10:29:18: out.size(): torch.Size([169343, 40])
The text was updated successfully, but these errors were encountered:
Having the same issue on ogbn-arxiv graph tokenization training. The model cannot converge.
Sorry, something went wrong.
No branches or pull requests
When running on ogbn-arxiv, we found the following error.
In order to avoid CUDA out of memory, we set batch_size=64, codebook_size=8192, num_layers=2, fan_out=5,10, and we found that it did not converge.
The text was updated successfully, but these errors were encountered: