-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
模型训练出现的问题 #39
Comments
这是我复现过程中的完整参数设置,除了展示的fedprox外,其他方法存在同样问题。 gen_config = { |
你好,非常感谢你的反馈。我尝试运行与你相同的命令,但是没有复现出你的bug。我所运行的代码是
训练的记录为:
|
我之前在使用的其他场景中碰到过相同的bug。当时发生的情况是,在某一轮某个用户的本地训练过程中的某次反向传播时,出现了nan数值,然后这个nan数值在服务器的聚合阶段影响到了全局模型,导致损失nan。这种情况在数据集分布较为niid且不均衡,且本地训练步数较大或步长较大时容易出现 |
是的,确实像您所说我所构建数据分布是non-iid且imbalance的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。 |
可以的,我在下个版本中加上自动消除nan的可选项 |
请问这个问题已经更新了吗 |
你好,已经在新版中更新了,具体位置在flgo.algorithm.fedbase.BasicServer.aggregate中增加了nan的检测 |
请问,我在用服务器远程跑的时候经常进程会卡死不懂,这个问题您清楚吗 |
你好,请问方便提供运行的命令和文件吗,我目前好像没有碰到过这个问题 |
是我运行的问题,请问我如何能调用迪利克雷分布中的proportions |
DirichletPartitioner中的proportions属于数据集划分阶段的临时变量,没有作为属性保存下来。如果要用到的话,一般是在运行联邦算法的初始化阶段再统计一次proportion,具体例子可以参考resources/algorithm/fedrod.py中的代码段: class Server(flgo.algorithm.fedbase.BasicServer):
def initialize(self):
self.init_algo_para({'lmbd':0.1, 'num_hidden_layers':1, 'hidden_dim':100})
self.num_classes = len(collections.Counter([d[-1] for d in self.test_data]))
for c in self.clients: c.num_classes = self.num_classes
self.hyper = self.num_hidden_layers>0
self.hnet = self.init_hnet().to(self.device) if self.hyper else None
class Client(flgo.algorithm.fedbase.BasicClient):
def initialize(self):
lb_counter = collections.Counter([d[-1] for d in self.train_data])
self.dist = torch.zeros((1, self.num_classes))
for k in lb_counter.keys():
self.dist[0][k] = lb_counter[k]
self.dist = self.dist / len(self.train_data)
self.head = copy.deepcopy(self.server.model.head)
self.hyper = self.num_hidden_layers>0 这里的dist等同于proportion |
非常感谢您的工作,但是我在复现您代码过程中,常会遇到如下问题,在模型迭代过程中,accuracy 会突然变成0,以及loss会趋近于Nan。我想知道出现这个问题的原因是什么。
{"option": {"sample": "md", "aggregate": "uniform", "num_rounds": 100, "proportion": 0.6, "learning_rate_decay": 0.998, "lr_scheduler": -1, "early_stop": -1, "num_epochs": 2, "num_steps": -1, "learning_rate": 0.1, "batch_size": 64.0, "optimizer": "SGD", "clip_grad": 0.0, "momentum": 0.0, "weight_decay": 0.0, "num_edge_rounds": 5, "algo_para": [], "train_holdout": 0.1, "test_holdout": 0.0, "local_test": false, "seed": 0, "gpu": [0], "server_with_cpu": false, "num_parallels": 1, "num_workers": 0, "pin_memory": false, "test_batch_size": 512, "availability": "IDL", "connectivity": "IDL", "completeness": "IDL", "responsiveness": "IDL", "log_level": "INFO", "log_file": true, "no_log_console": false, "no_overwrite": false, "eval_interval": 1, "task": "./my_task", "algorithm": "fedprox", "model": "cnn"}
`
`
The text was updated successfully, but these errors were encountered: