Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NanLossDuringTrainingError: NaN loss during training #105

Open
Wuxinxiaoshifu opened this issue Nov 3, 2021 · 0 comments
Open

NanLossDuringTrainingError: NaN loss during training #105

Wuxinxiaoshifu opened this issue Nov 3, 2021 · 0 comments

Comments

@Wuxinxiaoshifu
Copy link

INFO:tensorflow:Using config: {'_model_dir': '/home/yzh/v3plus/tensorflow-deeplab-v3-plus/dataset/test2/model/new', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 1000000000.0, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff57ace0c88>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Start training.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2021-11-02 23:17:44.611893: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-11-02 23:17:44.828950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.755
pciBusID: 0000:02:00.0
totalMemory: 23.70GiB freeMemory: 23.44GiB
2021-11-02 23:17:44.967547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.755
pciBusID: 0000:81:00.0
totalMemory: 23.69GiB freeMemory: 23.27GiB
2021-11-02 23:17:44.967602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2021-11-02 23:21:49.062821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-02 23:21:49.062859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2021-11-02 23:21:49.062867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N
2021-11-02 23:21:49.062871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N
2021-11-02 23:21:49.063044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22724 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6)
2021-11-02 23:21:49.063425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22555 MB memory) -> physical GPU (device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:81:00.0, compute capability: 8.6)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /home/yzh/v3plus/tensorflow-deeplab-v3-plus/dataset/test2/model/new/model.ckpt.
INFO:tensorflow:cross_entropy = 1.9338539, learning_rate = 0.007, train_mean_iou = 0.014417753, train_px_accuracy = 0.086506516
INFO:tensorflow:loss = 24.278753, step = 0
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "train.py", line 285, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train.py", line 267, in main
hooks=train_hooks,
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1320, in run
run_metadata=run_metadata))
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant