Classification example - segmentation fault on some systems #136

krichardsson · 2023-11-08T08:49:10Z

There is a discussion indicating that there are issues running the classification example.

I did a quick test and found some (other) problems:

The requirements.txt file contains fixed versions which is bad from a maintainability point of view and I got a bunch of conflicts on my machine.
When running python train_classifier.py I get a segmentation fault(!), not sure why.

My conclusion is that we should take a look at this example and make sure it works.

The text was updated successfully, but these errors were encountered:

gemenerik · 2024-02-19T12:40:35Z

There is a discussion indicating that there are issues running the classification example.

Answered ✅

1. The requirements.txt file contains fixed versions which is bad from a maintainability point of view and I got a bunch of conflicts on my machine.

Conflicts are avoided by using a separate Python environment to install the requirements into. From experience I know it can be real troublesome to work with deep learning repos with loose requirements. Considering this an application and not so much a library I think it should be acceptable to have fixed versions? But I'm curious to hear arguments for setting them loose.

2. When running `python train_classifier.py` I get a segmentation fault(!), not sure why.

With a Python=3.10 conda env + pip installing the requirements.txt (as instructed in the classification demo docs) training works for me out of the box.

luigifeola · 2024-06-26T14:49:54Z

2. When running `python train_classifier.py` I get a segmentation fault(!), not sure why.
With a Python=3.10 conda env + pip installing the requirements.txt (as instructed in the classification demo docs) training works for me out of the box.

Hi @gemenerik, I still have segmentation fault, event after creating a conda environment from scratch. Anything else I can do to execute the code?

gemenerik · 2024-06-28T08:27:03Z

Can you share some more details? Like what OS you are using? A terminal printout? Anything that helps me reproduce the problem.

luigifeola · 2024-06-28T13:11:03Z

Sure, here it is.
OS: Pop!_OS 22.04
Currently I created a conda environment, even if I installed all the packages listed in aideck-gap8-examples/examples/ai/classification/requirements.txt using as usual pip install -r requirements.txt

$ conda list -n ai-classification python
# packages in environment at /home/gigi-labs/.miniconda/envs/ai-classification:
#
# Name                    Version                   Build  Channel
python                    3.10.14              h955ad1f_1

This is the terminal output when I try to run the train_classifier.py script:

(ai-classification) gigi-labs@pop-os:/media/gigi-labs/T7/repos/bitcraze/aideck-gap8-examples/examples/ai/classification$ python train_classifier.py

2024-06-28 15:06:02.204756: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-28 15:06:02.298341: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-28 15:06:02.659914: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib
2024-06-28 15:06:02.659964: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib
2024-06-28 15:06:02.659968: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
training_data/*/*/*
Found 1375 images belonging to 2 classes.
Found 450 images belonging to 2 classes.
2024-06-28 15:06:03.066179: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-06-28 15:06:03.089418: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib
2024-06-28 15:06:03.089438: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2024-06-28 15:06:03.089635: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 separable_conv2d (Separable  (None, 162, 122, 3)      7         
 Conv2D)                                                         
                                                                 
 resizing (Resizing)         (None, 96, 96, 3)         0         
                                                                 
 mobilenetv2_0.35_96 (Functi  (None, 3, 3, 1280)       410208    
 onal)                                                           
                                                                 
 separable_conv2d_1 (Separab  (None, 1, 1, 32)         52512     
 leConv2D)                                                       
                                                                 
 dropout (Dropout)           (None, 1, 1, 32)          0         
                                                                 
 global_average_pooling2d (G  (None, 32)               0         
 lobalAveragePooling2D)                                          
                                                                 
 dense (Dense)               (None, 2)                 66        
                                                                 
=================================================================
Total params: 462,793
Trainable params: 52,585
Non-trainable params: 410,208
_________________________________________________________________
Number of trainable weights = 8
Epoch 1/20
Segmentation fault (core dumped)

Is it a problem if I store and run everything from an external SSD?
Any help is really appreciated.

gemenerik · 2024-07-02T09:03:02Z

Oof, that is not a very informative error. Can you run any of the official tensorflow examples for this install?

luigifeola · 2024-07-02T11:00:20Z

Some more info, nvcc --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I tried this quickstart example , and the model is correctly trained (exactly as in here)

luigifeola · 2024-07-03T06:31:03Z

Good news, the train_classifier.py script works with a docker container pre-built with tensorflow (without installing the packages in requirements.txt). The docker image is the nvcr.io/nvidia/tensorflow:23.03-tf2-py3 which runs Python 3.8.10

gemenerik · 2024-07-03T08:36:39Z

Good idea to try a docker container. Instead of an nvidia one, I will try to find a tensorflow/tensorflow container that works for this example.

EDIT: that will likely be tensorflow/tensorflow:2.11.0

gemenerik · 2024-07-03T09:11:29Z

If you have a chance to test it; create a file train_classifier.sh in the examples/ai/classification folder, with:

#!/usr/bin/env bash
set -e

full_path=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )

cd ${full_path}

pip install pillow scipy
python train_classifier.py

From repository root folder run:

docker run -it --rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow:2.11.0 examples/ai/classification/train_classifier.sh

luigifeola · 2024-07-03T13:32:43Z

Thanks for your support.

However it does not work. This is the output I got:

~/aideck-gap8-examples$ docker run -it --rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow:2.11.0 examples/ai/classification/train_classifier.sh

Collecting pillow
  Downloading pillow-10.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
     |████████████████████████████████| 4.4 MB 3.2 MB/s 
Collecting scipy
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
     |████████████████████████████████| 34.5 MB 7.5 MB/s 
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in /usr/local/lib/python3.8/dist-packages (from scipy) (1.23.4)
Installing collected packages: pillow, scipy
Successfully installed pillow-10.4.0 scipy-1.10.1
WARNING: You are using pip version 20.2.4; however, version 24.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
2024-07-03 13:30:47.917060: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-03 13:30:47.978827: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
training_data/*/*/*
Found 1375 images belonging to 2 classes.
Found 450 images belonging to 2 classes.
2024-07-03 13:30:48.742408: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_0.35_96_no_top.h5
2019640/2019640 [==============================] - 0s 0us/step
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 separable_conv2d (Separable  (None, 162, 122, 3)      7         
 Conv2D)                                                         
                                                                 
 resizing (Resizing)         (None, 96, 96, 3)         0         
                                                                 
 mobilenetv2_0.35_96 (Functi  (None, 3, 3, 1280)       410208    
 onal)                                                           
                                                                 
 separable_conv2d_1 (Separab  (None, 1, 1, 32)         52512     
 leConv2D)                                                       
                                                                 
 dropout (Dropout)           (None, 1, 1, 32)          0         
                                                                 
 global_average_pooling2d (G  (None, 32)               0         
 lobalAveragePooling2D)                                          
                                                                 
 dense (Dense)               (None, 2)                 66        
                                                                 
=================================================================
Total params: 462,793
Trainable params: 52,585
Non-trainable params: 410,208
_________________________________________________________________
Number of trainable weights = 8
Epoch 1/20
examples/ai/classification/train_classifier.sh: line 9:    13 Segmentation fault      (core dumped) python train_classifier.py

gemenerik · 2024-07-05T09:31:25Z

Curious. Do you have an NVIDIA GPU?

luigifeola · 2024-07-08T07:41:05Z

Sorry for the late reply.

Yes I have an NVIDIA GPU, this is my nvidia-smi output:

Mon Jul  8 09:39:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A3000 12GB La...    On  | 00000000:01:00.0  On |                  Off |
| N/A   44C    P0              21W /  80W |    914MiB / 12288MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

gemenerik · 2024-07-09T08:11:53Z

Thanks! I think for now we'll leave this issue open and consider the NVIDIA docker a workaround for NVIDIA GPU users that run into the segmentation fault.

gemenerik · 2024-07-10T12:19:22Z

@luigifeola the above might work with the tensorflow/tensorflow:2.11.0-gpu docker image

luigifeola · 2024-07-29T17:30:07Z

Hi @gemenerik sorry for the super late reply. Actually even with docker run -it --rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow:2.11.0-gpu examples/ai/classification/train_classifier.sh I got the segmentation fault error:

Collecting pillow
  Downloading pillow-10.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
     |████████████████████████████████| 4.4 MB 4.0 MB/s 
Collecting scipy
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
     |████████████████████████████████| 34.5 MB 154.1 MB/s 
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in /usr/local/lib/python3.8/dist-packages (from scipy) (1.23.4)
Installing collected packages: pillow, scipy
Successfully installed pillow-10.4.0 scipy-1.10.1
WARNING: You are using pip version 20.2.4; however, version 24.2 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
2024-07-29 17:27:50.002395: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-29 17:27:50.082995: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
training_data/*/*/*
Found 7785 images belonging to 2 classes.
Found 2601 images belonging to 2 classes.
2024-07-29 17:27:51.023124: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: UNKNOWN ERROR (34)
2024-07-29 17:27:51.023153: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2024-07-29 17:27:51.023290: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_0.35_96_no_top.h5
2019640/2019640 [==============================] - 0s 0us/step
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 separable_conv2d (Separable  (None, 162, 122, 3)      7         
 Conv2D)                                                         
                                                                 
 resizing (Resizing)         (None, 96, 96, 3)         0         
                                                                 
 mobilenetv2_0.35_96 (Functi  (None, 3, 3, 1280)       410208    
 onal)                                                           
                                                                 
 separable_conv2d_1 (Separab  (None, 1, 1, 32)         52512     
 leConv2D)                                                       
                                                                 
 dropout (Dropout)           (None, 1, 1, 32)          0         
                                                                 
 global_average_pooling2d (G  (None, 32)               0         
 lobalAveragePooling2D)                                          
                                                                 
 dense (Dense)               (None, 2)                 66        
                                                                 
=================================================================
Total params: 462,793
Trainable params: 52,585
Non-trainable params: 410,208
_________________________________________________________________
Number of trainable weights = 8
Epoch 1/20
examples/ai/classification/train_classifier.sh: line 9:    13 Segmentation fault      (core dumped) python train_classifier.py

knmcguire · 2024-07-31T07:37:33Z

Hi! Rik will be back next week so I'll notify him once he is back

gemenerik · 2024-08-05T12:05:13Z

It may be related to how TensorFlow is built, possibly involving the GPU. Works fine on a GTX 1080 system. Haven't been able to reproduce the problem and a workaround was found, so not digging deeper for now.

luigifeola · 2024-09-04T15:12:01Z

Hi @gemenerik,
My solution is now working using the tensorflow/tensorflow:2.11.0-gpu Docker image as you suggested. However, it's necessary to pass some additional arguments to the Docker container, such as: --gpus all --ipc=host --shm-size=4g --ulimit memlock=-1.

Additionally, the tensorflow Docker container recommends running in non-root mode. To follow this best practice, I created a custom image based on tensorflow/tensorflow:2.11.0-gpu, where I added a non-root user called user. I'm happy to share the custom image if needed.

The Lite model works well on my custom dataset, but when deployed, it detects ~90% of the time the background. This seems to be a separate issue, and I will open a new issue to address it. #145

Thanks again for your support!

gemenerik · 2024-09-16T10:04:26Z

Related to this, documentation has been updated to include instructions for Docker-based training

knmcguire added bug Something isn't working triage needed labels Feb 15, 2024

gemenerik closed this as completed Feb 20, 2024

knmcguire removed the triage needed label Mar 19, 2024

knmcguire reopened this Jul 1, 2024

gemenerik changed the title ~~Classification example is not working~~ Classification example - segmentation fault on some systems Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification example - segmentation fault on some systems #136

Classification example - segmentation fault on some systems #136

krichardsson commented Nov 8, 2023

gemenerik commented Feb 19, 2024 •

edited

Loading

luigifeola commented Jun 26, 2024

gemenerik commented Jun 28, 2024

luigifeola commented Jun 28, 2024

gemenerik commented Jul 2, 2024

luigifeola commented Jul 2, 2024

luigifeola commented Jul 3, 2024 •

edited by gemenerik

Loading

gemenerik commented Jul 3, 2024 •

edited

Loading

gemenerik commented Jul 3, 2024

luigifeola commented Jul 3, 2024

gemenerik commented Jul 5, 2024

luigifeola commented Jul 8, 2024

gemenerik commented Jul 9, 2024

gemenerik commented Jul 10, 2024

luigifeola commented Jul 29, 2024

knmcguire commented Jul 31, 2024

gemenerik commented Aug 5, 2024

luigifeola commented Sep 4, 2024 •

edited by gemenerik

Loading

gemenerik commented Sep 16, 2024

Classification example - segmentation fault on some systems #136

Classification example - segmentation fault on some systems #136

Comments

krichardsson commented Nov 8, 2023

gemenerik commented Feb 19, 2024 • edited Loading

luigifeola commented Jun 26, 2024

gemenerik commented Jun 28, 2024

luigifeola commented Jun 28, 2024

gemenerik commented Jul 2, 2024

luigifeola commented Jul 2, 2024

luigifeola commented Jul 3, 2024 • edited by gemenerik Loading

gemenerik commented Jul 3, 2024 • edited Loading

gemenerik commented Jul 3, 2024

luigifeola commented Jul 3, 2024

gemenerik commented Jul 5, 2024

luigifeola commented Jul 8, 2024

gemenerik commented Jul 9, 2024

gemenerik commented Jul 10, 2024

luigifeola commented Jul 29, 2024

knmcguire commented Jul 31, 2024

gemenerik commented Aug 5, 2024

luigifeola commented Sep 4, 2024 • edited by gemenerik Loading

gemenerik commented Sep 16, 2024

gemenerik commented Feb 19, 2024 •

edited

Loading

luigifeola commented Jul 3, 2024 •

edited by gemenerik

Loading

gemenerik commented Jul 3, 2024 •

edited

Loading

luigifeola commented Sep 4, 2024 •

edited by gemenerik

Loading