Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Autokeras/TF fails with CUDA_ERROR_ILLEGAL_ADDRESS/CUDA_ERROR_INVALID_HANDLE when max_trials is not 1 #1916

Open
billytcl opened this issue Apr 24, 2024 · 0 comments

Comments

@billytcl
Copy link

Bug Description

I'm having a weird problem where when I use ImageClassifier in Autokeras 2.0.0 without max_trials = 1, the whole pipeline crashes with a CUDA_ERROR_ILLEGAL_ADDRESS and CUDA_ERROR_INVALID_HANDLE error. It gets through the first trial just fine but upon starting the second trial it crashes.

Bug Reproduction

Code for reproducing the bug:

It's just boilerplate image classifier code:

batch_size = 10
	img_height = 99674 #was 29303965
	img_width = 6
	
	train_data = ak.image_dataset_from_directory(
		data_dir,
		# Use 20% data as testing data.
		validation_split=0.2,
		subset="training",
		# Set seed to ensure the same split when loading testing data.
		seed=0,
		image_size=(img_height, img_width),
		batch_size=batch_size,
	)
	
	test_data = ak.image_dataset_from_directory(
		data_dir,
		validation_split=0.2,
		subset="validation",
		seed=0,
		image_size=(img_height, img_width),
		batch_size=batch_size,
	)
	
	clf = ak.ImageClassifier(num_classes=2,
                         loss = "auc",
                         directory = args.out_model,
                         seed = 0)

	clf.fit(x=train_data, validation_data=test_data)

Output from training:

2024-04-24 14:40:09.349858: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-24 14:40:12.967614: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Namespace(files='/scratch/groups/hanleeji/CREST_images/chunk_100k/chunk001/', out_model='/scratch/groups/hanleeji/CREST_images/models/chunk001_ak/')
2024-04-24 14:40:20.850001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38367 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-04-24 14:40:24.193006: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-04-24 14:40:31.741614: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
Found 833 files belonging to 2 classes.
Using 667 files for training.
Found 833 files belonging to 2 classes.
Using 166 files for validation.

Search: Running Trial #1

Value             |Best Value So Far |Hyperparameter
vanilla           |vanilla           |image_block_1/block_type
True              |True              |image_block_1/normalize
False             |False             |image_block_1/augment
3                 |3                 |image_block_1/conv_block_1/kernel_size
1                 |1                 |image_block_1/conv_block_1/num_blocks
2                 |2                 |image_block_1/conv_block_1/num_layers
True              |True              |image_block_1/conv_block_1/max_pooling
False             |False             |image_block_1/conv_block_1/separable
0.25              |0.25              |image_block_1/conv_block_1/dropout
32                |32                |image_block_1/conv_block_1/filters_0_0
64                |64                |image_block_1/conv_block_1/filters_0_1
flatten           |flatten           |classification_head_1/spatial_reduction_1/reduction_type
0.5               |0.5               |classification_head_1/dropout
adam              |adam              |optimizer
0.001             |0.001             |learning_rate

Epoch 1/1000
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1713994833.007195   84052 service.cc:145] XLA service 0x7f1b04003600 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1713994833.007327   84052 service.cc:153]   StreamExecutor device (0): NVIDIA A100-PCIE-40GB, Compute Capability 8.0
2024-04-24 14:40:33.063821: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-24 14:40:33.754554: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907
I0000 00:00:1713994847.238606   84052 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
67/67 ━━━━━━━━━━━━━━━━━━━━ 29s 209ms/step - accuracy: 0.5139 - loss: 7.1502 - val_accuracy: 0.6145 - val_loss: 0.6860
Epoch 2/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 91ms/step - accuracy: 0.9425 - loss: 0.2806 - val_accuracy: 0.5964 - val_loss: 0.7732
Epoch 3/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 90ms/step - accuracy: 1.0000 - loss: 0.0206 - val_accuracy: 0.6024 - val_loss: 0.9873
Epoch 4/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 88ms/step - accuracy: 1.0000 - loss: 0.0035 - val_accuracy: 0.6446 - val_loss: 1.6338
Epoch 5/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 94ms/step - accuracy: 1.0000 - loss: 0.0016 - val_accuracy: 0.6386 - val_loss: 1.4921
Epoch 6/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 92ms/step - accuracy: 1.0000 - loss: 6.3239e-04 - val_accuracy: 0.6446 - val_loss: 1.6766
Epoch 7/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 91ms/step - accuracy: 1.0000 - loss: 4.7770e-04 - val_accuracy: 0.6446 - val_loss: 1.7681
Epoch 8/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 91ms/step - accuracy: 1.0000 - loss: 2.8391e-04 - val_accuracy: 0.6446 - val_loss: 1.8877
Epoch 9/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 92ms/step - accuracy: 1.0000 - loss: 2.8112e-04 - val_accuracy: 0.6386 - val_loss: 1.9450
Epoch 10/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 93ms/step - accuracy: 1.0000 - loss: 1.9816e-04 - val_accuracy: 0.6386 - val_loss: 2.0117
Epoch 11/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 93ms/step - accuracy: 1.0000 - loss: 1.6838e-04 - val_accuracy: 0.6386 - val_loss: 2.0466
2024-04-24 14:42:12.680198: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

Trial 1 Complete [00h 01m 41s]
val_loss: 0.6860103011131287

Best val_loss So Far: 0.6860103011131287
Total elapsed time: 00h 01m 41s

Search: Running Trial #2

Value             |Best Value So Far |Hyperparameter
resnet            |vanilla           |image_block_1/block_type
True              |True              |image_block_1/normalize
True              |False             |image_block_1/augment
True              |None              |image_block_1/image_augmentation_1/horizontal_flip
True              |None              |image_block_1/image_augmentation_1/vertical_flip
0                 |None              |image_block_1/image_augmentation_1/contrast_factor
0                 |None              |image_block_1/image_augmentation_1/rotation_factor
0.1               |None              |image_block_1/image_augmentation_1/translation_factor
0                 |None              |image_block_1/image_augmentation_1/zoom_factor
False             |None              |image_block_1/res_net_block_1/pretrained
resnet50          |None              |image_block_1/res_net_block_1/version
True              |None              |image_block_1/res_net_block_1/imagenet_size
global_avg        |flatten           |classification_head_1/spatial_reduction_1/reduction_type
0                 |0.5               |classification_head_1/dropout
adam              |adam              |optimizer
0.001             |0.001             |learning_rate

Epoch 1/1000
2024-04-24 14:42:37.904335: W tensorflow/core/kernels/gpu_utils.cc:68] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2024-04-24 14:42:41.385410: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'

2024-04-24 14:42:41.385517: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'

2024-04-24 14:42:41.385535: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2024-04-24 14:42:41.385565: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
         [[{{function_node __inference_one_step_on_data_59919}}{{node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference}}]]
2024-04-24 14:42:41.385615: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
Traceback (most recent call last):
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 274, in _try_run_and_update_trial
    self._run_and_update_trial(trial, *fit_args, **fit_kwargs)
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 239, in _run_and_update_trial
    results = self.run_trial(trial, *fit_args, **fit_kwargs)
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 314, in run_trial
    obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 102, in _build_and_fit_model
    _, history = utils.fit_with_adaptive_batch_size(
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 69, in fit_with_adaptive_batch_size
    history = run_with_adaptive_batch_size(
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 82, in run_with_adaptive_batch_size
    history = func(x=x, validation_data=validation_data, **fit_kwargs)
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 70, in <lambda>
    batch_size, lambda **kwargs: model.fit(**kwargs), **fit_kwargs
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference defined at (most recent call last):
  File "/home/groups/hanleeji/Scripts/billylau/CREST_autokeras.py", line 67, in <module>

  File "/home/groups/hanleeji/Scripts/billylau/CREST_autokeras.py", line 59, in main

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/tasks/image.py", line 168, in fit

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/auto_model.py", line 303, in fit

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 202, in search

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 234, in search

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 274, in _try_run_and_update_trial

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 239, in _run_and_update_trial

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 314, in run_trial

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 102, in _build_and_fit_model

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 69, in fit_with_adaptive_batch_size

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 82, in run_with_adaptive_batch_size

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 70, in <lambda>

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 314, in fit

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 833, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 889, in _call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 696, in _initialize

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 339, in converted_call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1673, in run

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3263, in call_for_each_replica

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 4061, in _call_for_each_replica

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 833, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 906, in _call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 132, in call_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 331, in converted_call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 104, in one_step_on_data

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 51, in train_step

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 199, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 151, in _run_through_graph

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 589, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 199, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 151, in _run_through_graph

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 589, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/normalization/batch_normalization.py", line 224, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/normalization/batch_normalization.py", line 289, in _moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/nn.py", line 1726, in moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/nn.py", line 724, in moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/nn.py", line 767, in _compute_moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/nn_impl.py", line 1315, in moments_v2

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/nn_impl.py", line 1267, in moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/gen_math_ops.py", line 12174, in squared_difference

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 670, in _create_op_internal

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2682, in _create_op_internal

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1177, in from_node_def

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1043, in _create_c_op

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/tf_stack.py", line 162, in extract_stack

'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
         [[{{node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference}}]] [Op:__inference_one_step_on_iterator_61440]
2024-04-24 14:42:43.435179: E external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:819] failed to record completion event; therefore, failed to create inter-stream dependency
2024-04-24 14:42:43.435283: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:2025] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7f10c2a8d600; host src: 0x7f1a35200000; size: 8=0x8
2024-04-24 14:42:43.435301: E external/local_xla/xla/stream_executor/stream.cc:331] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2024-04-24 14:42:43.435315: E external/local_xla/xla/stream_executor/cuda/cuda_event.cc:30] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-04-24 14:42:43.435327: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1

Expected Behavior

I'm expecting it to get through to other trials since trial 1 worked.

Setup Details

Include the details about the versions of:

  • OS type and version: Linux
  • Python: 3.10
  • autokeras: 2.0.0
  • keras-tuner: 1.4.7
  • scikit-learn:
  • numpy: 1.26.5
  • pandas:
  • tensorflow: 2.16.1

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant