Description
Hello, I've been trying to run the resnet + imagenet demo shown here https://github.com/elasticdeeplearning/edl/tree/develop/example/demo/collective for several days now but with no success. I've tried doing this on my local machine by pip installing paddle_edl
into a conda environment and all associated requirements in addition to trying with the recommended docker image:
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash
I'm having the suspicion that this demo is outdated and I was wondering if it could be updated or explained in more detail so that I can get it working. If the demo is still in fact working, can someone run the demo using the expected docker image and let me know of the exact steps to replicate it.
I've tried running the demo using several different combinations of steps but here is what I'm doing in general.
Reproduction Steps:
- Enter the recommended docker image and mount my imagenet dataset.
- Enter
edl/example/demo/collective
- Set
PADDLE_EDL_IMAGENET_PATH
,PADDLE_EDL_FLEET_CHECKPOINT_PATH
andPADDLE_JOBSERVER
- Run
./start_job_server.sh
- Run
./start_job_client.sh
- Find failures in either the pod logs, the worker log in each pod directory or client/server logs
Some issues that I have faced so far:
- I don't know what the specifications are for
train.txt
,test.txt
orval.txt
for the imagenet dataset and have errors using mine with the edl demo. What is the expected preprocessing strategy that edl uses for theirimagenet
dataset and how is it structured so I can use my ownimagenet
dataset. - This line (
) should be changed to
src_dir=../../collective/resnet50
. There must have been some directories moved around as this is not the correct pathway to the resnet files. - All but one of the created pods manage to establish a connection its desired endpoint. All the failed pods output a message such as:
not ready endpoints:['127.0.0.1:8073', '127.0.0.1:8075', '127.0.0.1:8077', '127.0.0.1:8079', '127.0.0.1:8081', '127.0.0.1:8083', '127.0.0.1:8085']
server not ready, wait 3 sec to retry...
nohup python -u paddle_edl.demo.collective.job_server_demo
should use an-m
flag instead of-u
aspaddle_edl
cannot be found otherwise- Same as the above point but for the client bash file
- These are a subset of the total issues I've had to debug to come this far
- I've also tried running the MNIST tutorial with no luck as well (https://github.com/elasticdeeplearning/edl/blob/develop/doc/boss_tutorial.md)
System information:
- PaddlePaddle version: I have tried with v1.8.5 locally and whatever version is packaged into the docker image
- EDL version: I have tried with v0.3.1 locally and whatever version is packaged into the docker image
- GPU: Tesla M60 with CUDA 9.0 and CUDNN 7.0
- OS Platform: Ubuntu 16.04.6 LTS
Thanks and looking forward to demoing the project!