-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix PLE network bug & sort file list for ps trainer #932
base: master
Are you sure you want to change the base?
Conversation
i是第几个task,需要乘上每个task的expert数量而不是task数量
|
fixed code style
在不同节点上的文件顺序可能不一致,split_file_list可能读到相同的文件。sort以后保证每个节点的文件列表顺序一致,拆分读取个节点不会读到重复文件。
更改reader是因为分布式训练时不同节点上的文件顺序可能不一致(都是无序状态),split_file_list后不同节点可能读到相同的文件。sort以后保证每个节点的文件列表顺序一致,拆分读取各节点不会读到重复文件。 |
pr 是正确的,同时 task_init 和 exp_init 的部分也有问题 |
@@ -179,7 +179,7 @@ def forward(self, input_data): | |||
# task-specific expert part | |||
for i in range(0, self.task_num): | |||
for j in range(0, self.exp_per_task): | |||
linear_out = self._param_expert[i * self.task_num + j]( | |||
linear_out = self._param_expert[i * self.exp_per_task + j]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
正确的
i表示第几个task,需要乘上每个task的expert数量,而不是乘task数量