You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug:
I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output.
The text was updated successfully, but these errors were encountered:
I don't see anything obvious from your logs. Given that it looks like the evaluator process stalled/quit, I'd check for CPU and memory usage on that node (when it's running) to get more clues. You can also try to run the TF cluster on a smaller scale on a single node without Spark by just running the code in separate processes using TF_CONFIG, i.e. just using distributed TF by itself. And with local processes, you should be able to debug the evaluator node a bit easier to see why it may be stalling.
Environment:
Describe the bug:
I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output.
The text was updated successfully, but these errors were encountered: