Evalator hangs while training #589

jiqiujia · 2022-08-02T04:01:28Z

Environment:

Python version 3.7
Spark version 2.4
TensorFlow version 2.5
TensorFlowOnSpark version 2.2.3
Cluster version hadoop

Describe the bug:
I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output.

leewyang · 2022-08-02T16:44:36Z

I don't see anything obvious from your logs. Given that it looks like the evaluator process stalled/quit, I'd check for CPU and memory usage on that node (when it's running) to get more clues. You can also try to run the TF cluster on a smaller scale on a single node without Spark by just running the code in separate processes using TF_CONFIG, i.e. just using distributed TF by itself. And with local processes, you should be able to debug the evaluator node a bit easier to see why it may be stalling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evalator hangs while training #589

Evalator hangs while training #589

jiqiujia commented Aug 2, 2022

leewyang commented Aug 2, 2022

Evalator hangs while training #589

Evalator hangs while training #589

Comments

jiqiujia commented Aug 2, 2022

leewyang commented Aug 2, 2022