Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed candidate evaluation support #52

Open
picarus opened this issue Dec 21, 2018 · 3 comments
Open

Add distributed candidate evaluation support #52

picarus opened this issue Dec 21, 2018 · 3 comments
Labels
enhancement New feature or request

Comments

@picarus
Copy link

picarus commented Dec 21, 2018

Hello,
I am running Adanet 0.5.0 in GCP with Runtime version 1.10.
I am using a CPU configuration with multiple nodes.
The training phase is very fast but it gets totally slowed down by the evaluations. The evaluation don't seem to take advantage of the multiple nodes and the logs are flooded with "Waiting for chief to finish" messages coming from the workers and generated by Adanet Estimator.
I think support for evaluation phase to use the multiple nodes should be added and that be a priority change as not only the nodes are not used, you also keep paying for them.
Is that feasible?
Thanks in advance
Jose

@cweill
Copy link
Contributor

cweill commented Dec 21, 2018

@picarus: This is a known issue when using the adanet.Evaluator in distributed training.

One way you can make the evaluation much faster is to pass the steps argument to its constructor (e.g. steps=100). This will end evaluation after n batches instead of evaluating over the full dataset. Alternatively if you do not pass an Evaluator to the adanet.Estimator, the Estimator will use a moving average of the train loss to determine the best candidate, and skip evaluation altogether. This option is fine if you have only a single candidate per iteration.

You're right that distributed evaluation should be a supported feature, unfortunately it is non-trivial to implement. Do you have any suggestions how you can shard evaluation across all the workers given an arbitrary input_fn?

@cweill cweill added the enhancement New feature or request label Dec 21, 2018
@cweill cweill changed the title Waiting for chief to finish Add distributed candidate evaluation support Dec 21, 2018
@picarus
Copy link
Author

picarus commented Jan 7, 2019

@cweill , I lack the deep knowledge you sure have about Adanet or even TF but unless you suggest the problem to implement this is on TF I don't see additional complexity other than the fact that you are evaluating multiple networks. Is it a TF issue?

@cweill
Copy link
Contributor

cweill commented Jan 7, 2019

@picarus: Unfortunately nothing is very straightforward in TF. :)

The challenges I see are:

  • Making sure this works for any number of workers and candidate subnetworks.
  • How to synchronize the workers so they don't look at the same data. Otherwise you may have incorrect metrics when evaluating. This is easy to do on one worker, but it's not obvious to me how to do it on multiple servers.

If you have any suggestions or pull request, I'm happy to chat more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants