You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug:
I have 2 issues regarding the TensorBoard when executing a training process of my model on 2 worker nodes:
1- The first one is that after the training process completed, the TensorBoard files get deleted immediately on worker 1 while they are kept at worker 0 although I can use TensorBoard to check details while the training process is running. 2- I am trying to profile my model to check the details of consumed time for batches 3 to 5 while training the model in the Profiler page but I get 0 ms for communication time, more specifically the Device Collective Communication and Device to Device Time. However the Average Step Time gives reasonable values like 19368.9 ms!
From the Hosts drop-down list I can see that there is only one detected host in the cluster, not 2. Why does this happen?
Logs:
If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.
When using the "built-in" TensorBoard server in TFoS (triggered by supplying tensorboard=True), the TB server is hosted in the "chief" worker, so it has the same lifecycle as the "chief" worker. That is, it will be killed when the Spark job completes. If you want visibility after the job completes, you can write the TB events to the shared/distributed filesystem and then spawn your own TB process pointing to this location.
This sounds like more of a question for the TensorFlow team since TFoS has nothing to do with these metrics. Regardless, I'm assuming that your environment somehow isn't set up to capture this information. For example, I'm guessing that "Device Collective Communication Time" is referring to something like NCCL, which you may not have (enabled) in your setup.
There are no GPUs in the cluster so that the worker nodes depend only on CPUs to process the data. As I understood from your answer, the Device Collective Communication time value is limited to GPU and NCCL. Isn't there any way to capture this value while using only the CPUs?
Environment:
Describe the bug:
I have 2 issues regarding the TensorBoard when executing a training process of my model on 2 worker nodes:
1- The first one is that after the training process completed, the TensorBoard files get deleted immediately on worker 1 while they are kept at worker 0 although I can use TensorBoard to check details while the training process is running.
2- I am trying to profile my model to check the details of consumed time for batches 3 to 5 while training the model in the Profiler page but I get
0 ms
for communication time, more specifically theDevice Collective Communication
andDevice to Device Time
. However theAverage Step Time
gives reasonable values like 19368.9 ms!From the
Hosts
drop-down list I can see that there is only one detected host in the cluster, not 2. Why does this happen?Logs:
If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.
Spark Submit Command Line:
spark-submit --master spark://master:7077 train_file.py --cluster_size 2 --epochs 1
The text was updated successfully, but these errors were encountered: