Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡会报错 #5

Open
vcbeaut opened this issue Aug 11, 2021 · 5 comments
Open

多卡会报错 #5

vcbeaut opened this issue Aug 11, 2021 · 5 comments

Comments

@vcbeaut
Copy link

vcbeaut commented Aug 11, 2021

使用楼主代码运行,多卡会报错,tf.split 分发数据出错

@HuiResearch
Copy link
Owner

使用楼主代码运行,多卡会报错,tf.split 分发数据出错

请问下是哪个代码呢?我这边排查下

@vcbeaut
Copy link
Author

vcbeaut commented Aug 13, 2021

使用楼主代码运行,多卡会报错,tf.split 分发数据出错

请问下是哪个代码呢?我这边排查下

tf.split代码只有几处,就在数据分发那里

@HuiResearch
Copy link
Owner

使用楼主代码运行,多卡会报错,tf.split 分发数据出错

请问下是哪个代码呢?我这边排查下

tf.split代码只有几处,就在数据分发那里

你好,我测试了一下,双卡三卡卡都是可以跑的,请问你报的什么错,能截图吗?

@pgr2015
Copy link

pgr2015 commented Sep 9, 2021

你好,这是截图,我是双卡训练,split分发数据有问题,应该是最后一个step分发的数据不平均导致的
微信图片_20210909162149

@pgr2015
Copy link

pgr2015 commented Sep 9, 2021

你好,我已经解决了这个问题,就是最后一个step不满batch_size的话可能会出问题,比如batch_size恰好不是偶数的话,split到两个device会出错,解决的方法也很简单,把这里改成True就行了。不过我个人认为数据分发的逻辑还是有些问题的,test阶段不应该把数据分发到多卡上,而是一张卡做test,其他卡不做任何操作。 @huanghuidmml 大佬你觉得呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants