Trouble Running Resnet + Imagenet Demo

Hello, I've been trying to run the resnet + imagenet demo shown here https://github.com/elasticdeeplearning/edl/tree/develop/example/demo/collective for several days now but with no success. I've tried doing this on my local machine by pip installing `paddle_edl` into a conda environment and all associated  requirements in addition to trying with the recommended docker image:
```
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash
```

I'm having the suspicion that this demo is outdated and I was wondering if it could be updated or explained in more detail so that I can get it working. If the demo is still in fact working, can someone run the demo using the expected docker image and let me know of the exact steps to replicate it.

I've tried running the demo using several different combinations of steps but here is what I'm doing in general.

**Reproduction Steps:**
1. Enter the recommended docker image and mount my imagenet dataset.
2. Enter `edl/example/demo/collective`
3. Set `PADDLE_EDL_IMAGENET_PATH`, `PADDLE_EDL_FLEET_CHECKPOINT_PATH` and `PADDLE_JOBSERVER`
4. Run `./start_job_server.sh`
5. Run `./start_job_client.sh`
6. Find failures in either the pod logs, the worker log in each pod directory or client/server logs

**Some issues that I have faced so far:**
- I don't know what the specifications are for `train.txt`, `test.txt` or `val.txt` for the imagenet dataset and have errors using mine with the edl demo. What is the expected preprocessing strategy that edl uses for their `imagenet` dataset and how is it structured so I can use my own `imagenet` dataset.
- This line (https://github.com/elasticdeeplearning/edl/blob/dbe38fba9692730c5f02eae1ac46dbb2c89a8321/example/demo/collective/resnet50/package.sh#L33) should be changed to `src_dir=../../collective/resnet50`. There must have been some directories moved around as this is not the correct pathway to the resnet files.
- All but one of the created pods manage to establish a connection its desired endpoint. All the failed pods output a message such as: 
```
not ready endpoints:['127.0.0.1:8073', '127.0.0.1:8075', '127.0.0.1:8077', '127.0.0.1:8079', '127.0.0.1:8081', '127.0.0.1:8083', '127.0.0.1:8085']
server not ready, wait 3 sec to retry...
```
- `nohup python -u paddle_edl.demo.collective.job_server_demo` should use an `-m` flag instead of `-u` as `paddle_edl` cannot be found otherwise https://github.com/elasticdeeplearning/edl/blob/dbe38fba9692730c5f02eae1ac46dbb2c89a8321/example/demo/collective/start_job_server.sh#L26
- Same as the above point but for the client bash file https://github.com/elasticdeeplearning/edl/blob/dbe38fba9692730c5f02eae1ac46dbb2c89a8321/example/demo/collective/start_job_client.sh#L33
- These are a subset of the total issues I've had to debug to come this far
- I've also tried running the MNIST tutorial with no luck as well (https://github.com/elasticdeeplearning/edl/blob/develop/doc/boss_tutorial.md)

**System information:**  
  - PaddlePaddle version: I have tried with v1.8.5 locally and whatever version is packaged into the docker image
  - EDL version: I have tried with v0.3.1 locally and whatever version is packaged into the docker image
  - GPU: Tesla M60 with CUDA 9.0 and CUDNN 7.0
  - OS Platform: Ubuntu 16.04.6 LTS

Thanks and looking forward to demoing the project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trouble Running Resnet + Imagenet Demo #154

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trouble Running Resnet + Imagenet Demo #154

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions