You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Synthetic Model Single GPU Example always gets OOM, even I use a A100 machine and set batch_size=1
python main.py --model small --optimizer sgd --batch_size 1
root@2437d34894a8:/workspaces/distributed-embeddings/examples/benchmarks/synthetic_models# python main.py --model small --optimizer sgd --batch_size 1
2023-08-10 06:59:19.339639: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-10 06:59:27.281514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1638] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38111 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:07:00.0, compute capability: 8.0
I0810 06:59:27.387234 139766424715712 synthetic_models.py:144] 107 embedding tables created.
I0810 06:59:27.409688 139766424715712 synthetic_models.py:83] Generated 116 categorical inputs for 107 embedding tables
2023-08-10 06:59:30.842389: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1780] (One-time warning): Not using XLA:CPU for cluster.
If you want XLA:CPU, do one of the following:
- set the TF_XLA_FLAGS to include "--tf_xla_cpu_global_jit", or
- set cpu_global_jit to true on this session's OptimizerOptions, or
- use experimental_jit_scope, or
- use tf.function(jit_compile=True).
To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a
proper command-line flag, not via TF_XLA_FLAGS).
/usr/local/lib/python3.10/dist-packages/keras/initializers/initializers.py:120: UserWarning: The initializer RandomUniform is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or a
void using the same initalizer instance more than once.
warnings.warn(
2023-08-10 06:59:59.695493: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:655] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-08-10 07:00:13.955161: W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 26.29GiB (rounded to 28224000000)requested by op Fill
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2023-08-10 07:00:13.955235: I tensorflow/tsl/framework/bfc_allocator.cc:1039] BFCAllocator dump for GPU_0_bfc
2023-08-10 07:00:13.955251: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (256): Total Chunks: 244, Chunks in use: 244. 61.0KiB allocated for chunks. 61.0KiB in use in bin. 3.9KiB client-requested in use in bin.
2023-08-10 07:00:13.955261: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (512): Total Chunks: 2, Chunks in use: 2. 1.0KiB allocated for chunks. 1.0KiB in use in bin. 1.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955271: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (1024): Total Chunks: 3, Chunks in use: 2. 3.8KiB allocated for chunks. 2.2KiB in use in bin. 2.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955281: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (2048): Total Chunks: 1, Chunks in use: 1. 2.0KiB allocated for chunks. 2.0KiB in use in bin. 2.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955290: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955299: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (8192): Total Chunks: 1, Chunks in use: 0. 10.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955308: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955317: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955325: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955336: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (131072): Total Chunks: 2, Chunks in use: 1. 381.5KiB allocated for chunks. 128.0KiB in use in bin. 128.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955345: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955356: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (524288): Total Chunks: 2, Chunks in use: 1. 1.55MiB allocated for chunks. 950.2KiB in use in bin. 512.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955365: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955373: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (2097152): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955384: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (4194304): Total Chunks: 1, Chunks in use: 1. 4.93MiB allocated for chunks. 4.93MiB in use in bin. 4.93MiB client-requested in use in bin.
2023-08-10 07:00:13.955394: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (8388608): Total Chunks: 2, Chunks in use: 2. 17.85MiB allocated for chunks. 17.85MiB in use in bin. 15.91MiB client-requested in use in bin.
2023-08-10 07:00:13.955403: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955411: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955421: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955429: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955440: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (268435456): Total Chunks: 2, Chunks in use: 1. 31.99GiB allocated for chunks. 26.29GiB in use in bin. 26.29GiB client-requested in use in bin.
2023-08-10 07:00:13.955450: I tensorflow/tsl/framework/bfc_allocator.cc:1062] Bin for 26.29GiB was 256.00MiB, Chunk State:
2023-08-10 07:00:13.955466: I tensorflow/tsl/framework/bfc_allocator.cc:1068] Size: 5.70GiB | Requested Size: 64B | in_use: 0 | bin_num: 20, prev: Size: 4.93MiB | Requested Size: 4.93MiB | in_use: 1 | bin_num: -1
2023-08-10 07:00:13.955474: I tensorflow/tsl/framework/bfc_allocator.cc:1075] Next region of size 34359738368
2023-08-10 07:00:13.955485: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7efca0000000 of size 28224000000 next 239
2023-08-10 07:00:13.955495: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f0332481000 of size 10330112 next 356
2023-08-10 07:00:13.955503: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f0332e5b000 of size 5165056 next 354
2023-08-10 07:00:13.955510: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f0333348000 of size 6120243200 next 18446744073709551615
2023-08-10 07:00:13.955518: I tensorflow/tsl/framework/bfc_allocator.cc:1075] Next region of size 8388608
2023-08-10 07:00:13.955525: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1247000000 of size 8388608 next 18446744073709551615
2023-08-10 07:00:13.955533: I tensorflow/tsl/framework/bfc_allocator.cc:1075] Next region of size 2097152
2023-08-10 07:00:13.955540: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400000 of size 256 next 1
2023-08-10 07:00:13.955548: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400100 of size 1280 next 2
2023-08-10 07:00:13.955555: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400600 of size 256 next 3
2023-08-10 07:00:13.955565: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400700 of size 256 next 4
2023-08-10 07:00:13.955572: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400800 of size 256 next 5
2023-08-10 07:00:13.955580: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400900 of size 256 next 6
2023-08-10 07:00:13.955588: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400a00 of size 256 next 7
2023-08-10 07:00:13.955596: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400b00 of size 256 next 8
2023-08-10 07:00:13.955605: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400c00 of size 256 next 9
2023-08-10 07:00:13.955613: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400d00 of size 256 next 10
......
2023-08-10 07:00:13.957322: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140eb00 of size 256 next 232
2023-08-10 07:00:13.957329: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140ec00 of size 256 next 233
2023-08-10 07:00:13.957336: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140ed00 of size 256 next 234
2023-08-10 07:00:13.957344: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140ee00 of size 256 next 235
2023-08-10 07:00:13.957352: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140ef00 of size 256 next 236
2023-08-10 07:00:13.957359: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f171140f000 of size 10240 next 240
2023-08-10 07:00:13.957366: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711411800 of size 256 next 350
2023-08-10 07:00:13.957373: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711411900 of size 256 next 353
2023-08-10 07:00:13.957381: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711411a00 of size 2048 next 351
2023-08-10 07:00:13.957388: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412200 of size 256 next 352
2023-08-10 07:00:13.957395: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412300 of size 256 next 355
2023-08-10 07:00:13.957404: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412400 of size 1024 next 357
2023-08-10 07:00:13.957412: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412800 of size 256 next 360
2023-08-10 07:00:13.957421: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412900 of size 256 next 358
2023-08-10 07:00:13.957433: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412a00 of size 512 next 362
2023-08-10 07:00:13.957442: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412c00 of size 256 next 363
2023-08-10 07:00:13.957450: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412d00 of size 256 next 359
2023-08-10 07:00:13.957651: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412e00 of size 256 next 368
2023-08-10 07:00:13.957659: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f1711412f00 of size 1536 next 369
2023-08-10 07:00:13.957667: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711413500 of size 512 next 371
2023-08-10 07:00:13.957675: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f1711413700 of size 259584 next 366
2023-08-10 07:00:13.957683: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711452d00 of size 131072 next 365
2023-08-10 07:00:13.957691: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f1711472d00 of size 653824 next 361
2023-08-10 07:00:13.957700: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711512700 of size 973056 next 18446744073709551615
2023-08-10 07:00:13.957707: I tensorflow/tsl/framework/bfc_allocator.cc:1100] Summary of in-use Chunks by size:
2023-08-10 07:00:13.957717: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 244 Chunks of size 256 totalling 61.0KiB
2023-08-10 07:00:13.957726: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 2 Chunks of size 512 totalling 1.0KiB
2023-08-10 07:00:13.957734: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 1024 totalling 1.0KiB
2023-08-10 07:00:13.957741: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 1280 totalling 1.2KiB
2023-08-10 07:00:13.957750: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 2048 totalling 2.0KiB
2023-08-10 07:00:13.957759: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 131072 totalling 128.0KiB
2023-08-10 07:00:13.957768: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 973056 totalling 950.2KiB
2023-08-10 07:00:13.957776: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 5165056 totalling 4.93MiB
2023-08-10 07:00:13.957785: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 8388608 totalling 8.00MiB
2023-08-10 07:00:13.957794: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 10330112 totalling 9.85MiB
2023-08-10 07:00:13.957802: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 28224000000 totalling 26.29GiB
2023-08-10 07:00:13.957810: I tensorflow/tsl/framework/bfc_allocator.cc:1107] Sum Total of in-use chunks: 26.31GiB
2023-08-10 07:00:13.957818: I tensorflow/tsl/framework/bfc_allocator.cc:1109] Total bytes in pool: 34370224128 memory_limit_: 39963262976 available bytes: 5593038848 curr_region_allocation_bytes_: 34359738368
2023-08-10 07:00:13.957832: I tensorflow/tsl/framework/bfc_allocator.cc:1114] Stats:
Limit: 39963262976
InUse: 28249055744
MaxInUse: 28249055744
NumAllocs: 874
MaxAllocSize: 28224000000
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2023-08-10 07:00:13.957855: W tensorflow/tsl/framework/bfc_allocator.cc:497] ***********************************************************************************________________*
2023-08-10 07:00:13.957901: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at constant_op.cc:175 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[220500000,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/workspaces/distributed-embeddings/examples/benchmarks/synthetic_models/main.py", line 162, in <module>
app.run(main)
File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/workspaces/distributed-embeddings/examples/benchmarks/synthetic_models/main.py", line 135, in main
loss = train_step(numerical_features, cat_features, labels)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/func_graph.py", line 1200, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: in user code:
File "/workspaces/distributed-embeddings/examples/benchmarks/synthetic_models/main.py", line 129, in train_step *
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer.py", line 1174, in apply_gradients **
return super().apply_gradients(grads_and_vars, name=name)
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer.py", line 637, in apply_gradients
self.build(trainable_variables)
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/sgd.py", line 146, in build
self.add_variable_from_reference(
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer.py", line 1106, in add_variable_from_reference
return super().add_variable_from_reference(
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer.py", line 507, in add_variable_from_reference
initial_value = tf.zeros(
ResourceExhaustedError: {{function_node __wrapped__Fill_device_/job:localhost/replica:0/task:0/device:GPU:0}} OOM when allocating tensor with shape[220500000,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Fill]
The text was updated successfully, but these errors were encountered:
Synthetic Model Single GPU Example always gets OOM, even I use a A100 machine and set batch_size=1
python main.py --model small --optimizer sgd --batch_size 1
The text was updated successfully, but these errors were encountered: