You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I've been trying to run the resnet + imagenet demo shown here https://github.com/elasticdeeplearning/edl/tree/develop/example/demo/collective for several days now but with no success. I've tried doing this on my local machine by pip installing paddle_edl into a conda environment and all associated requirements in addition to trying with the recommended docker image:
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash
I'm having the suspicion that this demo is outdated and I was wondering if it could be updated or explained in more detail so that I can get it working. If the demo is still in fact working, can someone run the demo using the expected docker image and let me know of the exact steps to replicate it.
I've tried running the demo using several different combinations of steps but here is what I'm doing in general.
Reproduction Steps:
Enter the recommended docker image and mount my imagenet dataset.
Enter edl/example/demo/collective
Set PADDLE_EDL_IMAGENET_PATH, PADDLE_EDL_FLEET_CHECKPOINT_PATH and PADDLE_JOBSERVER
Run ./start_job_server.sh
Run ./start_job_client.sh
Find failures in either the pod logs, the worker log in each pod directory or client/server logs
Some issues that I have faced so far:
I don't know what the specifications are for train.txt, test.txt or val.txt for the imagenet dataset and have errors using mine with the edl demo. What is the expected preprocessing strategy that edl uses for their imagenet dataset and how is it structured so I can use my own imagenet dataset.
) should be changed to src_dir=../../collective/resnet50. There must have been some directories moved around as this is not the correct pathway to the resnet files.
All but one of the created pods manage to establish a connection its desired endpoint. All the failed pods output a message such as:
not ready endpoints:['127.0.0.1:8073', '127.0.0.1:8075', '127.0.0.1:8077', '127.0.0.1:8079', '127.0.0.1:8081', '127.0.0.1:8083', '127.0.0.1:8085']
server not ready, wait 3 sec to retry...
nohup python -u paddle_edl.demo.collective.job_server_demo should use an -m flag instead of -u as paddle_edl cannot be found otherwise
Hello, I've been trying to run the resnet + imagenet demo shown here https://github.com/elasticdeeplearning/edl/tree/develop/example/demo/collective for several days now but with no success. I've tried doing this on my local machine by pip installing
paddle_edl
into a conda environment and all associated requirements in addition to trying with the recommended docker image:I'm having the suspicion that this demo is outdated and I was wondering if it could be updated or explained in more detail so that I can get it working. If the demo is still in fact working, can someone run the demo using the expected docker image and let me know of the exact steps to replicate it.
I've tried running the demo using several different combinations of steps but here is what I'm doing in general.
Reproduction Steps:
edl/example/demo/collective
PADDLE_EDL_IMAGENET_PATH
,PADDLE_EDL_FLEET_CHECKPOINT_PATH
andPADDLE_JOBSERVER
./start_job_server.sh
./start_job_client.sh
Some issues that I have faced so far:
train.txt
,test.txt
orval.txt
for the imagenet dataset and have errors using mine with the edl demo. What is the expected preprocessing strategy that edl uses for theirimagenet
dataset and how is it structured so I can use my ownimagenet
dataset.edl/example/demo/collective/resnet50/package.sh
Line 33 in dbe38fb
src_dir=../../collective/resnet50
. There must have been some directories moved around as this is not the correct pathway to the resnet files.nohup python -u paddle_edl.demo.collective.job_server_demo
should use an-m
flag instead of-u
aspaddle_edl
cannot be found otherwiseedl/example/demo/collective/start_job_server.sh
Line 26 in dbe38fb
edl/example/demo/collective/start_job_client.sh
Line 33 in dbe38fb
System information:
Thanks and looking forward to demoing the project!
The text was updated successfully, but these errors were encountered: