Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble Running Resnet + Imagenet Demo #154

Open
aleksficek opened this issue Oct 4, 2020 · 0 comments
Open

Trouble Running Resnet + Imagenet Demo #154

aleksficek opened this issue Oct 4, 2020 · 0 comments

Comments

@aleksficek
Copy link

aleksficek commented Oct 4, 2020

Hello, I've been trying to run the resnet + imagenet demo shown here https://github.com/elasticdeeplearning/edl/tree/develop/example/demo/collective for several days now but with no success. I've tried doing this on my local machine by pip installing paddle_edl into a conda environment and all associated requirements in addition to trying with the recommended docker image:

docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash

I'm having the suspicion that this demo is outdated and I was wondering if it could be updated or explained in more detail so that I can get it working. If the demo is still in fact working, can someone run the demo using the expected docker image and let me know of the exact steps to replicate it.

I've tried running the demo using several different combinations of steps but here is what I'm doing in general.

Reproduction Steps:

  1. Enter the recommended docker image and mount my imagenet dataset.
  2. Enter edl/example/demo/collective
  3. Set PADDLE_EDL_IMAGENET_PATH, PADDLE_EDL_FLEET_CHECKPOINT_PATH and PADDLE_JOBSERVER
  4. Run ./start_job_server.sh
  5. Run ./start_job_client.sh
  6. Find failures in either the pod logs, the worker log in each pod directory or client/server logs

Some issues that I have faced so far:

  • I don't know what the specifications are for train.txt, test.txt or val.txt for the imagenet dataset and have errors using mine with the edl demo. What is the expected preprocessing strategy that edl uses for their imagenet dataset and how is it structured so I can use my own imagenet dataset.
  • This line (
    src_dir=../../../collective/resnet50
    ) should be changed to src_dir=../../collective/resnet50. There must have been some directories moved around as this is not the correct pathway to the resnet files.
  • All but one of the created pods manage to establish a connection its desired endpoint. All the failed pods output a message such as:
not ready endpoints:['127.0.0.1:8073', '127.0.0.1:8075', '127.0.0.1:8077', '127.0.0.1:8079', '127.0.0.1:8081', '127.0.0.1:8083', '127.0.0.1:8085']
server not ready, wait 3 sec to retry...

System information:

  • PaddlePaddle version: I have tried with v1.8.5 locally and whatever version is packaged into the docker image
  • EDL version: I have tried with v0.3.1 locally and whatever version is packaged into the docker image
  • GPU: Tesla M60 with CUDA 9.0 and CUDNN 7.0
  • OS Platform: Ubuntu 16.04.6 LTS

Thanks and looking forward to demoing the project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant