You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having a weird problem where when I use ImageClassifier in Autokeras 2.0.0 without max_trials = 1, the whole pipeline crashes with a CUDA_ERROR_ILLEGAL_ADDRESS and CUDA_ERROR_INVALID_HANDLE error. It gets through the first trial just fine but upon starting the second trial it crashes.
Bug Reproduction
Code for reproducing the bug:
It's just boilerplate image classifier code:
batch_size = 10
img_height = 99674 #was 29303965
img_width = 6
train_data = ak.image_dataset_from_directory(
data_dir,
# Use 20% data as testing data.
validation_split=0.2,
subset="training",
# Set seed to ensure the same split when loading testing data.
seed=0,
image_size=(img_height, img_width),
batch_size=batch_size,
)
test_data = ak.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=0,
image_size=(img_height, img_width),
batch_size=batch_size,
)
clf = ak.ImageClassifier(num_classes=2,
loss = "auc",
directory = args.out_model,
seed = 0)
clf.fit(x=train_data, validation_data=test_data)
Output from training:
2024-04-24 14:40:09.349858: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-24 14:40:12.967614: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Namespace(files='/scratch/groups/hanleeji/CREST_images/chunk_100k/chunk001/', out_model='/scratch/groups/hanleeji/CREST_images/models/chunk001_ak/')
2024-04-24 14:40:20.850001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38367 MB memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-04-24 14:40:24.193006: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-04-24 14:40:31.741614: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
Found 833 files belonging to 2 classes.
Using 667 files for training.
Found 833 files belonging to 2 classes.
Using 166 files for validation.
Search: Running Trial #1
Value |Best Value So Far |Hyperparameter
vanilla |vanilla |image_block_1/block_type
True |True |image_block_1/normalize
False |False |image_block_1/augment
3 |3 |image_block_1/conv_block_1/kernel_size
1 |1 |image_block_1/conv_block_1/num_blocks
2 |2 |image_block_1/conv_block_1/num_layers
True |True |image_block_1/conv_block_1/max_pooling
False |False |image_block_1/conv_block_1/separable
0.25 |0.25 |image_block_1/conv_block_1/dropout
32 |32 |image_block_1/conv_block_1/filters_0_0
64 |64 |image_block_1/conv_block_1/filters_0_1
flatten |flatten |classification_head_1/spatial_reduction_1/reduction_type
0.5 |0.5 |classification_head_1/dropout
adam |adam |optimizer
0.001 |0.001 |learning_rate
Epoch 1/1000
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1713994833.007195 84052 service.cc:145] XLA service 0x7f1b04003600 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1713994833.007327 84052 service.cc:153] StreamExecutor device (0): NVIDIA A100-PCIE-40GB, Compute Capability 8.0
2024-04-24 14:40:33.063821: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-24 14:40:33.754554: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907
I0000 00:00:1713994847.238606 84052 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
67/67 ━━━━━━━━━━━━━━━━━━━━ 29s 209ms/step - accuracy: 0.5139 - loss: 7.1502 - val_accuracy: 0.6145 - val_loss: 0.6860
Epoch 2/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 91ms/step - accuracy: 0.9425 - loss: 0.2806 - val_accuracy: 0.5964 - val_loss: 0.7732
Epoch 3/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 90ms/step - accuracy: 1.0000 - loss: 0.0206 - val_accuracy: 0.6024 - val_loss: 0.9873
Epoch 4/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 88ms/step - accuracy: 1.0000 - loss: 0.0035 - val_accuracy: 0.6446 - val_loss: 1.6338
Epoch 5/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 94ms/step - accuracy: 1.0000 - loss: 0.0016 - val_accuracy: 0.6386 - val_loss: 1.4921
Epoch 6/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 92ms/step - accuracy: 1.0000 - loss: 6.3239e-04 - val_accuracy: 0.6446 - val_loss: 1.6766
Epoch 7/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 91ms/step - accuracy: 1.0000 - loss: 4.7770e-04 - val_accuracy: 0.6446 - val_loss: 1.7681
Epoch 8/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 91ms/step - accuracy: 1.0000 - loss: 2.8391e-04 - val_accuracy: 0.6446 - val_loss: 1.8877
Epoch 9/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 92ms/step - accuracy: 1.0000 - loss: 2.8112e-04 - val_accuracy: 0.6386 - val_loss: 1.9450
Epoch 10/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 93ms/step - accuracy: 1.0000 - loss: 1.9816e-04 - val_accuracy: 0.6386 - val_loss: 2.0117
Epoch 11/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 93ms/step - accuracy: 1.0000 - loss: 1.6838e-04 - val_accuracy: 0.6386 - val_loss: 2.0466
2024-04-24 14:42:12.680198: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
Trial 1 Complete [00h 01m 41s]
val_loss: 0.6860103011131287
Best val_loss So Far: 0.6860103011131287
Total elapsed time: 00h 01m 41s
Search: Running Trial #2
Value |Best Value So Far |Hyperparameter
resnet |vanilla |image_block_1/block_type
True |True |image_block_1/normalize
True |False |image_block_1/augment
True |None |image_block_1/image_augmentation_1/horizontal_flip
True |None |image_block_1/image_augmentation_1/vertical_flip
0 |None |image_block_1/image_augmentation_1/contrast_factor
0 |None |image_block_1/image_augmentation_1/rotation_factor
0.1 |None |image_block_1/image_augmentation_1/translation_factor
0 |None |image_block_1/image_augmentation_1/zoom_factor
False |None |image_block_1/res_net_block_1/pretrained
resnet50 |None |image_block_1/res_net_block_1/version
True |None |image_block_1/res_net_block_1/imagenet_size
global_avg |flatten |classification_head_1/spatial_reduction_1/reduction_type
0 |0.5 |classification_head_1/dropout
adam |adam |optimizer
0.001 |0.001 |learning_rate
Epoch 1/1000
2024-04-24 14:42:37.904335: W tensorflow/core/kernels/gpu_utils.cc:68] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2024-04-24 14:42:41.385410: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
2024-04-24 14:42:41.385517: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2024-04-24 14:42:41.385535: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2024-04-24 14:42:41.385565: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
[[{{function_node __inference_one_step_on_data_59919}}{{node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference}}]]
2024-04-24 14:42:41.385615: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
Traceback (most recent call last):
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 274, in _try_run_and_update_trial
self._run_and_update_trial(trial, *fit_args, **fit_kwargs)
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 239, in _run_and_update_trial
results = self.run_trial(trial, *fit_args, **fit_kwargs)
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 314, in run_trial
obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 102, in _build_and_fit_model
_, history = utils.fit_with_adaptive_batch_size(
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 69, in fit_with_adaptive_batch_size
history = run_with_adaptive_batch_size(
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 82, in run_with_adaptive_batch_size
history = func(x=x, validation_data=validation_data, **fit_kwargs)
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 70, in <lambda>
batch_size, lambda **kwargs: model.fit(**kwargs), **fit_kwargs
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference defined at (most recent call last):
File "/home/groups/hanleeji/Scripts/billylau/CREST_autokeras.py", line 67, in <module>
File "/home/groups/hanleeji/Scripts/billylau/CREST_autokeras.py", line 59, in main
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/tasks/image.py", line 168, in fit
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/auto_model.py", line 303, in fit
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 202, in search
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 234, in search
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 274, in _try_run_and_update_trial
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 239, in _run_and_update_trial
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 314, in run_trial
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 102, in _build_and_fit_model
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 69, in fit_with_adaptive_batch_size
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 82, in run_with_adaptive_batch_size
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 70, in <lambda>
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 314, in fit
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 833, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 889, in _call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 696, in _initialize
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 339, in converted_call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1673, in run
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3263, in call_for_each_replica
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 4061, in _call_for_each_replica
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 833, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 906, in _call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 132, in call_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 331, in converted_call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 104, in one_step_on_data
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 51, in train_step
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 199, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 151, in _run_through_graph
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 589, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 199, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 151, in _run_through_graph
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 589, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/normalization/batch_normalization.py", line 224, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/normalization/batch_normalization.py", line 289, in _moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/nn.py", line 1726, in moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/nn.py", line 724, in moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/nn.py", line 767, in _compute_moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/nn_impl.py", line 1315, in moments_v2
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/nn_impl.py", line 1267, in moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/gen_math_ops.py", line 12174, in squared_difference
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 670, in _create_op_internal
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2682, in _create_op_internal
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1177, in from_node_def
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1043, in _create_c_op
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/tf_stack.py", line 162, in extract_stack
'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
[[{{node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference}}]] [Op:__inference_one_step_on_iterator_61440]
2024-04-24 14:42:43.435179: E external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:819] failed to record completion event; therefore, failed to create inter-stream dependency
2024-04-24 14:42:43.435283: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:2025] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7f10c2a8d600; host src: 0x7f1a35200000; size: 8=0x8
2024-04-24 14:42:43.435301: E external/local_xla/xla/stream_executor/stream.cc:331] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2024-04-24 14:42:43.435315: E external/local_xla/xla/stream_executor/cuda/cuda_event.cc:30] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-04-24 14:42:43.435327: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1
Expected Behavior
I'm expecting it to get through to other trials since trial 1 worked.
Setup Details
Include the details about the versions of:
OS type and version: Linux
Python: 3.10
autokeras: 2.0.0
keras-tuner: 1.4.7
scikit-learn:
numpy: 1.26.5
pandas:
tensorflow: 2.16.1
Additional context
The text was updated successfully, but these errors were encountered:
Bug Description
I'm having a weird problem where when I use ImageClassifier in Autokeras 2.0.0 without max_trials = 1, the whole pipeline crashes with a CUDA_ERROR_ILLEGAL_ADDRESS and CUDA_ERROR_INVALID_HANDLE error. It gets through the first trial just fine but upon starting the second trial it crashes.
Bug Reproduction
Code for reproducing the bug:
It's just boilerplate image classifier code:
Output from training:
Expected Behavior
I'm expecting it to get through to other trials since trial 1 worked.
Setup Details
Include the details about the versions of:
Additional context
The text was updated successfully, but these errors were encountered: