Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

模型训练出现的问题 #39

Open
timohaha opened this issue Aug 15, 2023 · 11 comments
Open

模型训练出现的问题 #39

timohaha opened this issue Aug 15, 2023 · 11 comments

Comments

@timohaha
Copy link

timohaha commented Aug 15, 2023

非常感谢您的工作,但是我在复现您代码过程中,常会遇到如下问题,在模型迭代过程中,accuracy 会突然变成0,以及loss会趋近于Nan。我想知道出现这个问题的原因是什么。
{"option": {"sample": "md", "aggregate": "uniform", "num_rounds": 100, "proportion": 0.6, "learning_rate_decay": 0.998, "lr_scheduler": -1, "early_stop": -1, "num_epochs": 2, "num_steps": -1, "learning_rate": 0.1, "batch_size": 64.0, "optimizer": "SGD", "clip_grad": 0.0, "momentum": 0.0, "weight_decay": 0.0, "num_edge_rounds": 5, "algo_para": [], "train_holdout": 0.1, "test_holdout": 0.0, "local_test": false, "seed": 0, "gpu": [0], "server_with_cpu": false, "num_parallels": 1, "num_workers": 0, "pin_memory": false, "test_batch_size": 512, "availability": "IDL", "connectivity": "IDL", "completeness": "IDL", "responsiveness": "IDL", "log_level": "INFO", "log_file": true, "no_log_console": false, "no_overwrite": false, "eval_interval": 1, "task": "./my_task", "algorithm": "fedprox", "model": "cnn"}
`

  • 2023-08-11 12:33:58,111 fedbase.py run [line:246] INFO --------------Round 80--------------

  • 2023-08-11 12:33:58,111 simple_logger.py log_once [line:14] INFO Current_time:80

  • 2023-08-11 12:34:01,120 simple_logger.py log_once [line:28] INFO test_accuracy 0.5547

  • 2023-08-11 12:34:01,120 simple_logger.py log_once [line:28] INFO test_loss 1.3093

  • 2023-08-11 12:34:01,120 simple_logger.py log_once [line:28] INFO val_accuracy 0.5546

  • 2023-08-11 12:34:01,121 simple_logger.py log_once [line:28] INFO mean_val_accuracy 0.5612

  • 2023-08-11 12:34:01,121 simple_logger.py log_once [line:28] INFO std_val_accuracy 0.1447

  • 2023-08-11 12:34:01,121 simple_logger.py log_once [line:28] INFO val_loss 1.3208

  • 2023-08-11 12:34:01,121 simple_logger.py log_once [line:28] INFO mean_val_loss 1.2945

  • 2023-08-11 12:34:01,121 simple_logger.py log_once [line:28] INFO std_val_loss 0.4200

  • 2023-08-11 12:34:01,121 fedbase.py run [line:251] INFO Eval Time Cost: 3.0099s

  • 2023-08-11 12:34:14,004 fedbase.py run [line:246] INFO --------------Round 81--------------

  • 2023-08-11 12:34:14,004 simple_logger.py log_once [line:14] INFO Current_time:81

  • 2023-08-11 12:34:17,050 simple_logger.py log_once [line:28] INFO test_accuracy 0.5408

  • 2023-08-11 12:34:17,050 simple_logger.py log_once [line:28] INFO test_loss 1.4533

  • 2023-08-11 12:34:17,050 simple_logger.py log_once [line:28] INFO val_accuracy 0.5417

  • 2023-08-11 12:34:17,050 simple_logger.py log_once [line:28] INFO mean_val_accuracy 0.5378

  • 2023-08-11 12:34:17,050 simple_logger.py log_once [line:28] INFO std_val_accuracy 0.1402

  • 2023-08-11 12:34:17,051 simple_logger.py log_once [line:28] INFO val_loss 1.4877

  • 2023-08-11 12:34:17,051 simple_logger.py log_once [line:28] INFO mean_val_loss 1.4758

  • 2023-08-11 12:34:17,051 simple_logger.py log_once [line:28] INFO std_val_loss 0.5131

  • 2023-08-11 12:34:17,051 fedbase.py run [line:251] INFO Eval Time Cost: 3.0476s

  • 2023-08-11 12:34:29,085 fedbase.py run [line:246] INFO --------------Round 82--------------

  • 2023-08-11 12:34:29,085 simple_logger.py log_once [line:14] INFO Current_time:82

  • 2023-08-11 12:34:32,096 simple_logger.py log_once [line:28] INFO test_accuracy 0.5482

  • 2023-08-11 12:34:32,096 simple_logger.py log_once [line:28] INFO test_loss 1.3859

  • 2023-08-11 12:34:32,096 simple_logger.py log_once [line:28] INFO val_accuracy 0.5411

  • 2023-08-11 12:34:32,096 simple_logger.py log_once [line:28] INFO mean_val_accuracy 0.5413

  • 2023-08-11 12:34:32,097 simple_logger.py log_once [line:28] INFO std_val_accuracy 0.1292

  • 2023-08-11 12:34:32,097 simple_logger.py log_once [line:28] INFO val_loss 1.4131

  • 2023-08-11 12:34:32,097 simple_logger.py log_once [line:28] INFO mean_val_loss 1.4190

  • 2023-08-11 12:34:32,097 simple_logger.py log_once [line:28] INFO std_val_loss 0.4560

  • 2023-08-11 12:34:32,097 fedbase.py run [line:251] INFO Eval Time Cost: 3.0121s

  • 2023-08-11 12:34:44,857 fedbase.py run [line:246] INFO --------------Round 83--------------

  • 2023-08-11 12:34:44,858 simple_logger.py log_once [line:14] INFO Current_time:83

  • 2023-08-11 12:34:47,908 simple_logger.py log_once [line:28] INFO test_accuracy 0.1000

  • 2023-08-11 12:34:47,908 simple_logger.py log_once [line:28] INFO test_loss nan

  • 2023-08-11 12:34:47,909 simple_logger.py log_once [line:28] INFO val_accuracy 0.0996

  • 2023-08-11 12:34:47,909 simple_logger.py log_once [line:28] INFO mean_val_accuracy 0.0760

  • 2023-08-11 12:34:47,909 simple_logger.py log_once [line:28] INFO std_val_accuracy 0.2083

  • 2023-08-11 12:34:47,909 simple_logger.py log_once [line:28] INFO val_loss nan

  • 2023-08-11 12:34:47,909 simple_logger.py log_once [line:28] INFO mean_val_loss nan

  • 2023-08-11 12:34:47,909 simple_logger.py log_once [line:28] INFO std_val_loss nan

  • 2023-08-11 12:34:47,909 fedbase.py run [line:251] INFO Eval Time Cost: 3.0507s

  • 2023-08-11 12:35:00,954 fedbase.py run [line:246] INFO --------------Round 84--------------

  • 2023-08-11 12:35:00,954 simple_logger.py log_once [line:14] INFO Current_time:84

  • 2023-08-11 12:35:03,999 simple_logger.py log_once [line:28] INFO test_accuracy 0.1000

  • 2023-08-11 12:35:03,999 simple_logger.py log_once [line:28] INFO test_loss nan

  • 2023-08-11 12:35:04,000 simple_logger.py log_once [line:28] INFO val_accuracy 0.0996

  • 2023-08-11 12:35:04,000 simple_logger.py log_once [line:28] INFO mean_val_accuracy 0.0760

  • 2023-08-11 12:35:04,000 simple_logger.py log_once [line:28] INFO std_val_accuracy 0.2083

  • 2023-08-11 12:35:04,000 simple_logger.py log_once [line:28] INFO val_loss nan

  • 2023-08-11 12:35:04,000 simple_logger.py log_once [line:28] INFO mean_val_loss nan

  • 2023-08-11 12:35:04,000 simple_logger.py log_once [line:28] INFO std_val_loss nan

  • 2023-08-11 12:35:04,000 fedbase.py run [line:251] INFO Eval Time Cost: 3.0459s

  • 2023-08-11 12:35:14,175 fedbase.py run [line:246] INFO --------------Round 85--------------

  • 2023-08-11 12:35:14,175 simple_logger.py log_once [line:14] INFO Current_time:85

  • 2023-08-11 12:35:17,211 simple_logger.py log_once [line:28] INFO test_accuracy 0.1000

  • 2023-08-11 12:35:17,211 simple_logger.py log_once [line:28] INFO test_loss nan

  • 2023-08-11 12:35:17,211 simple_logger.py log_once [line:28] INFO val_accuracy 0.0996

  • 2023-08-11 12:35:17,211 simple_logger.py log_once [line:28] INFO mean_val_accuracy 0.0760

  • 2023-08-11 12:35:17,211 simple_logger.py log_once [line:28] INFO std_val_accuracy 0.2083

  • 2023-08-11 12:35:17,211 simple_logger.py log_once [line:28] INFO val_loss nan

  • 2023-08-11 12:35:17,211 simple_logger.py log_once [line:28] INFO mean_val_loss nan

  • 2023-08-11 12:35:17,211 simple_logger.py log_once [line:28] INFO std_val_loss nan

  • 2023-08-11 12:35:17,211 fedbase.py run [line:251] INFO Eval Time Cost: 3.0360s

  • 2023-08-11 12:35:31,371 fedbase.py run [line:246] INFO --------------Round 86--------------

  • 2023-08-11 12:35:31,371 simple_logger.py log_once [line:14] INFO Current_time:86

  • 2023-08-11 12:35:34,401 simple_logger.py log_once [line:28] INFO test_accuracy 0.1000

  • 2023-08-11 12:35:34,401 simple_logger.py log_once [line:28] INFO test_loss nan

  • 2023-08-11 12:35:34,401 simple_logger.py log_once [line:28] INFO val_accuracy 0.0996

  • 2023-08-11 12:35:34,401 simple_logger.py log_once [line:28] INFO mean_val_accuracy 0.0760

  • 2023-08-11 12:35:34,401 simple_logger.py log_once [line:28] INFO std_val_accuracy 0.2083

  • 2023-08-11 12:35:34,401 simple_logger.py log_once [line:28] INFO val_loss nan

  • 2023-08-11 12:35:34,401 simple_logger.py log_once [line:28] INFO mean_val_loss nan

  • 2023-08-11 12:35:34,401 simple_logger.py log_once [line:28] INFO std_val_loss nan

  • 2023-08-11 12:35:34,401 fedbase.py run [line:251] INFO Eval Time Cost: 3.0308s

`

@timohaha
Copy link
Author

timohaha commented Aug 15, 2023

这是我复现过程中的完整参数设置,除了展示的fedprox外,其他方法存在同样问题。
{"option": {"sample": "md", "aggregate": "uniform", "num_rounds": 100, "proportion": 0.6, "learning_rate_decay": 0.998, "lr_scheduler": -1, "early_stop": -1, "num_epochs": 2, "num_steps": -1, "learning_rate": 0.1, "batch_size": 64.0, "optimizer": "SGD", "clip_grad": 0.0, "momentum": 0.0, "weight_decay": 0.0, "num_edge_rounds": 5, "algo_para": [], "train_holdout": 0.1, "test_holdout": 0.0, "local_test": false, "seed": 0, "gpu": [0], "server_with_cpu": false, "num_parallels": 1, "num_workers": 0, "pin_memory": false, "test_batch_size": 512, "availability": "IDL", "connectivity": "IDL", "completeness": "IDL", "responsiveness": "IDL", "log_level": "INFO", "log_file": true, "no_log_console": false, "no_overwrite": false, "eval_interval": 1, "task": "./my_task", "algorithm": "fedprox", "model": "cnn"}

gen_config = {
'benchmark':{'name':'flgo.benchmark.cifar10_classification'},
# 'partitioner':{'name':'IIDPartitioner', 'para':{'num_clients':100}}
'partitioner':{'name': 'DirichletPartitioner','para':{'num_clients':20, 'alpha':0.5,'imbalance':0.5}}
}

@WwZzz
Copy link
Owner

WwZzz commented Aug 15, 2023

你好,非常感谢你的反馈。我尝试运行与你相同的命令,但是没有复现出你的bug。我所运行的代码是

import flgo

gen_config = {
'benchmark':{'name':'flgo.benchmark.cifar10_classification'},
# 'partitioner':{'name':'IIDPartitioner', 'para':{'num_clients':100}}
'partitioner':{'name': 'DirichletPartitioner','para':{'num_clients':20, 'alpha':0.5,'imbalance':0.5}}
}

op = {"sample": "md", "aggregate": "uniform", "num_rounds": 100, "proportion": 0.6, "learning_rate_decay": 0.998, "lr_scheduler": -1, "early_stop": -1, "num_epochs": 2, "num_steps": -1, "learning_rate": 0.1, "batch_size": 64.0, "optimizer": "SGD", "clip_grad": 0.0, "momentum": 0.0, "weight_decay": 0.0, "num_edge_rounds": 5, "algo_para": [], "train_holdout": 0.1, "test_holdout": 0.0, "seed": 0, "gpu": [0],  "pin_memory":True, "test_batch_size": 512,}
task = "./my_cifar10"
import os
if not os.path.exists(task): flgo.gen_task(gen_config, task)

import flgo.algorithm.fedprox as fedprox
from flgo.benchmark.cifar10_classification.model import cnn

runner = flgo.init(task, fedprox, op, model=cnn)
runner.run()

训练的记录为:

2023-08-15 11:43:14,556 fedbase.py run [line:246] INFO --------------Round 99--------------
2023-08-15 11:43:14,556 simple_logger.py log_once [line:14] INFO Current_time:99
2023-08-15 11:43:17,089 simple_logger.py log_once [line:28] INFO test_accuracy                 0.5831
2023-08-15 11:43:17,089 simple_logger.py log_once [line:28] INFO test_loss                     1.1739
2023-08-15 11:43:17,089 simple_logger.py log_once [line:28] INFO val_accuracy                  0.5784
2023-08-15 11:43:17,089 simple_logger.py log_once [line:28] INFO mean_val_accuracy             0.5945
2023-08-15 11:43:17,089 simple_logger.py log_once [line:28] INFO std_val_accuracy              0.1658
2023-08-15 11:43:17,089 simple_logger.py log_once [line:28] INFO val_loss                      1.1702
2023-08-15 11:43:17,089 simple_logger.py log_once [line:28] INFO mean_val_loss                 1.1268
2023-08-15 11:43:17,089 simple_logger.py log_once [line:28] INFO std_val_loss                  0.3629
2023-08-15 11:43:17,089 fedbase.py run [line:251] INFO Eval Time Cost:               2.5332s
2023-08-15 11:43:28,403 fedbase.py run [line:246] INFO --------------Round 100--------------
2023-08-15 11:43:28,403 simple_logger.py log_once [line:14] INFO Current_time:100
2023-08-15 11:43:30,808 simple_logger.py log_once [line:28] INFO test_accuracy                 0.5603
2023-08-15 11:43:30,808 simple_logger.py log_once [line:28] INFO test_loss                     1.2288
2023-08-15 11:43:30,808 simple_logger.py log_once [line:28] INFO val_accuracy                  0.5537
2023-08-15 11:43:30,808 simple_logger.py log_once [line:28] INFO mean_val_accuracy             0.5742
2023-08-15 11:43:30,808 simple_logger.py log_once [line:28] INFO std_val_accuracy              0.1743
2023-08-15 11:43:30,808 simple_logger.py log_once [line:28] INFO val_loss                      1.2375
2023-08-15 11:43:30,808 simple_logger.py log_once [line:28] INFO mean_val_loss                 1.1856
2023-08-15 11:43:30,809 simple_logger.py log_once [line:28] INFO std_val_loss                  0.3935
2023-08-15 11:43:30,809 fedbase.py run [line:251] INFO Eval Time Cost:               2.4054s
2023-08-15 11:43:30,809 fedbase.py run [line:257] INFO =================End==================
2023-08-15 11:43:30,809 fedbase.py run [line:258] INFO Total Time Cost:              1244.4753s

@WwZzz
Copy link
Owner

WwZzz commented Aug 15, 2023

这是我复现过程中的完整参数设置,除了展示的fedprox外,其他方法存在同样问题。 {"option": {"sample": "md", "aggregate": "uniform", "num_rounds": 100, "proportion": 0.6, "learning_rate_decay": 0.998, "lr_scheduler": -1, "early_stop": -1, "num_epochs": 2, "num_steps": -1, "learning_rate": 0.1, "batch_size": 64.0, "optimizer": "SGD", "clip_grad": 0.0, "momentum": 0.0, "weight_decay": 0.0, "num_edge_rounds": 5, "algo_para": [], "train_holdout": 0.1, "test_holdout": 0.0, "local_test": false, "seed": 0, "gpu": [0], "server_with_cpu": false, "num_parallels": 1, "num_workers": 0, "pin_memory": false, "test_batch_size": 512, "availability": "IDL", "connectivity": "IDL", "completeness": "IDL", "responsiveness": "IDL", "log_level": "INFO", "log_file": true, "no_log_console": false, "no_overwrite": false, "eval_interval": 1, "task": "./my_task", "algorithm": "fedprox", "model": "cnn"}

gen_config = { 'benchmark':{'name':'flgo.benchmark.cifar10_classification'}, # 'partitioner':{'name':'IIDPartitioner', 'para':{'num_clients':100}} 'partitioner':{'name': 'DirichletPartitioner','para':{'num_clients':20, 'alpha':0.5,'imbalance':0.5}} }

我之前在使用的其他场景中碰到过相同的bug。当时发生的情况是,在某一轮某个用户的本地训练过程中的某次反向传播时,出现了nan数值,然后这个nan数值在服务器的聚合阶段影响到了全局模型,导致损失nan。这种情况在数据集分布较为niid且不均衡,且本地训练步数较大或步长较大时容易出现

@timohaha
Copy link
Author

是的,确实像您所说我所构建数据分布是non-iid且imbalance的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。

@WwZzz
Copy link
Owner

WwZzz commented Aug 15, 2023

是的,确实像您所说我所构建数据分布是non-iid且imbalance的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。

可以的,我在下个版本中加上自动消除nan的可选项

@Lyy838354973
Copy link

是的,确实像您所说我所构建数据分布是non-IID且不平衡的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。

可以的,我在下个版本中加上自动消除nan的可选项

请问这个问题已经更新了吗

@WwZzz
Copy link
Owner

WwZzz commented Sep 4, 2023

是的,确实像您所说我所构建数据分布是non-IID且不平衡的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。

可以的,我在下个版本中加上自动消除nan的可选项

请问这个问题已经更新了吗

你好,已经在新版中更新了,具体位置在flgo.algorithm.fedbase.BasicServer.aggregate中增加了nan的检测

@Lyy838354973
Copy link

是的,确实像您所说我所构建数据分布是non-IID且不平衡的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。

可以的,我在下个版本中加上自动消除nan的可选项

请问这个问题已经更新了吗

你好,已经在新版中更新了,具体位置在flgo.algorithm.fedbase.BasicServer.aggregate中增加了nan的检测

请问,我在用服务器远程跑的时候经常进程会卡死不懂,这个问题您清楚吗

@WwZzz
Copy link
Owner

WwZzz commented Sep 5, 2023

是的,确实像您所说我所构建数据分布是non-IID且不平衡的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。

可以的,我在下个版本中加上自动消除nan的可选项

请问这个问题已经更新了吗

你好,已经在新版中更新了,具体位置在flgo.algorithm.fedbase.BasicServer.aggregate中增加了nan的检测

请问,我在用服务器远程跑的时候经常进程会卡死不懂,这个问题您清楚吗

你好,请问方便提供运行的命令和文件吗,我目前好像没有碰到过这个问题

@Lyy838354973
Copy link

是的,确实像您所说我所构建数据分布是non-IID且不平衡的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。

可以的,我在下个版本中加上自动消除nan的可选项

请问这个问题已经更新了吗

你好,已经在新版中更新了,具体位置在flgo.algorithm.fedbase.BasicServer.aggregate中增加了nan的检测

请问,我在用服务器远程跑的时候经常进程会卡死不懂,这个问题您清楚吗

你好,请问方便提供运行的命令和文件吗,我目前好像没有碰到过这个问题

是我运行的问题,请问我如何能调用迪利克雷分布中的proportions

@WwZzz
Copy link
Owner

WwZzz commented Sep 7, 2023

是的,确实像您所说我所构建数据分布是non-IID且不平衡的。很大概率会导致某次反向传播出现了nan数值。是否可以在聚合时加上判断,来消除这类错误。

可以的,我在下个版本中加上自动消除nan的可选项

请问这个问题已经更新了吗

你好,已经在新版中更新了,具体位置在flgo.algorithm.fedbase.BasicServer.aggregate中增加了nan的检测

请问,我在用服务器远程跑的时候经常进程会卡死不懂,这个问题您清楚吗

你好,请问方便提供运行的命令和文件吗,我目前好像没有碰到过这个问题

是我运行的问题,请问我如何能调用迪利克雷分布中的proportions

DirichletPartitioner中的proportions属于数据集划分阶段的临时变量,没有作为属性保存下来。如果要用到的话,一般是在运行联邦算法的初始化阶段再统计一次proportion,具体例子可以参考resources/algorithm/fedrod.py中的代码段:

class Server(flgo.algorithm.fedbase.BasicServer):
    def initialize(self):
        self.init_algo_para({'lmbd':0.1, 'num_hidden_layers':1, 'hidden_dim':100})
        self.num_classes = len(collections.Counter([d[-1] for d in self.test_data]))
        for c in self.clients: c.num_classes = self.num_classes
        self.hyper = self.num_hidden_layers>0
        self.hnet = self.init_hnet().to(self.device) if self.hyper else None

class Client(flgo.algorithm.fedbase.BasicClient):
    def initialize(self):
        lb_counter = collections.Counter([d[-1] for d in self.train_data])
        self.dist = torch.zeros((1, self.num_classes))
        for k in lb_counter.keys():
            self.dist[0][k] = lb_counter[k]
        self.dist = self.dist / len(self.train_data)
        self.head = copy.deepcopy(self.server.model.head)
        self.hyper = self.num_hidden_layers>0

这里的dist等同于proportion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants