多卡训练 #101

sususama · 2023-12-01T03:26:37Z

请问怎么多卡训练，我看到 distributed.py 中只设置了一张卡，我将它设置为[0,1,2,3]的时候直接报错

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

The text was updated successfully, but these errors were encountered:

hfj-cc · 2023-12-11T06:48:13Z

我理解的，distributed.py只是对运行的线程数进行设置，实际上在dist_model = nn.DataParallel(model).cuda()中就已经设置了多卡训练。

hfj-cc · 2023-12-11T06:51:26Z

代码只使用了一个进程，会对训练时间有影响吗？另外多张卡训练，batch_size应该设置多大呢？我在4张1080Ti上训练，设置batch_size=96，训练时间要50多个小时。

Provide feedback