Add distributed candidate evaluation support #52

picarus · 2018-12-21T02:55:54Z

Hello,
I am running Adanet 0.5.0 in GCP with Runtime version 1.10.
I am using a CPU configuration with multiple nodes.
The training phase is very fast but it gets totally slowed down by the evaluations. The evaluation don't seem to take advantage of the multiple nodes and the logs are flooded with "Waiting for chief to finish" messages coming from the workers and generated by Adanet Estimator.
I think support for evaluation phase to use the multiple nodes should be added and that be a priority change as not only the nodes are not used, you also keep paying for them.
Is that feasible?
Thanks in advance
Jose

cweill · 2018-12-21T15:59:47Z

@picarus: This is a known issue when using the adanet.Evaluator in distributed training.

One way you can make the evaluation much faster is to pass the steps argument to its constructor (e.g. steps=100). This will end evaluation after n batches instead of evaluating over the full dataset. Alternatively if you do not pass an Evaluator to the adanet.Estimator, the Estimator will use a moving average of the train loss to determine the best candidate, and skip evaluation altogether. This option is fine if you have only a single candidate per iteration.

You're right that distributed evaluation should be a supported feature, unfortunately it is non-trivial to implement. Do you have any suggestions how you can shard evaluation across all the workers given an arbitrary input_fn?

picarus · 2019-01-07T01:56:09Z

@cweill , I lack the deep knowledge you sure have about Adanet or even TF but unless you suggest the problem to implement this is on TF I don't see additional complexity other than the fact that you are evaluating multiple networks. Is it a TF issue?

cweill · 2019-01-07T02:35:18Z

@picarus: Unfortunately nothing is very straightforward in TF. :)

The challenges I see are:

Making sure this works for any number of workers and candidate subnetworks.
How to synchronize the workers so they don't look at the same data. Otherwise you may have incorrect metrics when evaluating. This is easy to do on one worker, but it's not obvious to me how to do it on multiple servers.

If you have any suggestions or pull request, I'm happy to chat more.

cweill added the enhancement New feature or request label Dec 21, 2018

cweill changed the title ~~Waiting for chief to finish~~ Add distributed candidate evaluation support Dec 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed candidate evaluation support #52

Add distributed candidate evaluation support #52

picarus commented Dec 21, 2018 •

edited

Loading

cweill commented Dec 21, 2018

picarus commented Jan 7, 2019

cweill commented Jan 7, 2019 •

edited

Loading

Add distributed candidate evaluation support #52

Add distributed candidate evaluation support #52

Comments

picarus commented Dec 21, 2018 • edited Loading

cweill commented Dec 21, 2018

picarus commented Jan 7, 2019

cweill commented Jan 7, 2019 • edited Loading

picarus commented Dec 21, 2018 •

edited

Loading

cweill commented Jan 7, 2019 •

edited

Loading