Segmentation fault while training #70

WEIIEW97 · 2022-09-02T03:00:11Z

It occured 'Segmentation fault (core dumped)' for cpu version and 'cudaCheckError() failed : invalid device function. Segmentation fault (core dumped)' for CUDA version every time when I trained this network. How could it be solved? Thanks in advance.

ironheads · 2022-09-06T16:53:39Z

same problem

WEIIEW97 · 2022-09-13T09:40:58Z

same problem

I fixed this problem by transferring the original CUDA code from c++ to python(numba.cuda). Maybe it avoids some complying bugs. (I tried to use pybind11 but it failed as well)

ironheads · 2022-09-13T11:00:32Z

@WEIIEW97
hey, sir.
Actually i dive in the cuda c++ file. Then i found that the input image Tensor should be in size [CHANNEL_SIZE(3), BATCH_SIZE,WIDTH,HEIGHT]. However, in the python code , the input image is [BATCH_SIZE, CHANNEL_SIZE(3), WIDTH, HEIGHT]. So the rgb value is not correct. and another thing is that you should make the image pixel value between [0,1].

Actually, i was using this code because another work depends on it. and i use it to re-implement that work. So i didn't dive in the python code of this work. But after i change the compose procedure(compose different LUTs result) and the TrilinearInterpolationFunction of the python code. In addition, make sure the values of the image tensor is in [0,1]. It seems the code of that work (which depends on this code) can run normally.

I changed some code in c++ cuda, but i'm not sure whether it helps.

I'd like to know whether you fixed this bug in your python(numba.cuda) code. And it's better if you can share your numbda.cuda file.

Looking forward to your reply.

Thanks.
trilinear_cpp.zip

WEIIEW97 · 2022-09-16T03:17:17Z

@ironheads
hello,

Thanks for your response and sharing! I did not dive deep as you did thus I'd like to appreciate your information.
Attached below is my python(numba.cuda) code. I cannot guarantee robustness since it is just a basic migration.

trilinearcuda.zip

ironheads · 2022-09-16T04:33:59Z

@WEIIEW97
Thanks for your sharing.
If It is a basic migration. I think there are some codes need to modify in the original training/evaluation python code.
Some Modifications show as following.

# LUT0 = Generator3DLUT_identity()
# LUT1 = Generator3DLUT_zero()
# LUT2 = Generator3DLUT_zero()
#...
#...
# img = some images in  the dataset whose shape is [batch_size,3,width,height], make sure that the values are in range [0,1] (My segmentation fault comes from this)
# the following code comes from image_adaptive_lut_train_paired.py
pred = classifier(img).squeeze() # the img is still [batch_size,3,width,height]
# then you should modify the codes as following
new_img = img.permute(1,0,2,3).contiguous()
gen_A0 = LUT0(new_img)
gen_A1 = LUT1(new_img)
gen_A2 = LUT2(new_img)
combine_A = new_img.new(new_img.size())
    for b in range(new_img.size(1)):
        combine_A[:,b,:,:] = pred[b,0] * gen_A0[:,b,:,:] + pred[b,1] * gen_A1[:,b,:,:] + pred[b,2] * gen_A2[:,b,:,:] #+ pred[b,3] * gen_A3[:,b,:,:] + pred[b,4] * gen_A4[:,b,:,:]

result_A = combine_A.permute(1,0,2,3) #get the [batch_size,3,width,height] combined image
# the key is make the LUT's input in shape [3,batch_size,width,height], there maybe some other codes need to be modified when using LUT. I don't list them all because i do not use all the python codes in this work.

# another important modification is TrilinearInterpolationFunction, it should be modified as following.
class TrilinearInterpolationFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, lut, x):
        x = x.contiguous()
        output = x.new(x.size())
        dim = lut.size()[-1]
        shift = dim ** 3
        binsize = 1.000001 / (dim-1)
        W = x.size(2)
        H = x.size(3)
        batch = x.size(1) # this changes
        # print(batch)
        assert 1 == trilinear.forward(lut, 
                                      x, 
                                      output,
                                      dim, 
                                      shift, 
                                      binsize, 
                                      W, 
                                      H, 
                                      batch)

        int_package = torch.IntTensor([dim, shift, W, H, batch])
        float_package = torch.FloatTensor([binsize])
        variables = [lut, x, int_package, float_package]
        
        ctx.save_for_backward(*variables)
        
        return lut, output
    
    @staticmethod
    def backward(ctx, lut_grad, x_grad):
        
        lut, x, int_package, float_package = ctx.saved_variables
        dim, shift, W, H, batch = int_package
        dim, shift, W, H, batch = int(dim), int(shift), int(W), int(H), int(batch)
        binsize = float(float_package[0])
            
        assert 1 == trilinear.backward(x, 
                                       x_grad, 
                                       lut_grad,
                                       dim, 
                                       shift, 
                                       binsize, 
                                       W, 
                                       H, 
                                       batch)
        return lut_grad, x_grad

all the modifications aim to make the input of LUT in shape [3,batch_size,width,height] and then reshape result of LUT into [batch_size,3,width,height].

Another important thing is that the input values should be in range [0,1]

I don't know whether this will help If your code fails when batch_size > 1.
but if you set batch_size > 1, you should modify the code. because when batch_size > 1, the original code is not a implementation of LUT.

WEIIEW97 · 2022-09-16T09:02:45Z

@ironheads
Thank you so much! I will do further research on it.

sd707589 mentioned this issue Feb 19, 2024

'Segmentation fault (core dumped)' in image_harmonization bcmi/libcom#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault while training #70

Segmentation fault while training #70

WEIIEW97 commented Sep 2, 2022

ironheads commented Sep 6, 2022

WEIIEW97 commented Sep 13, 2022

ironheads commented Sep 13, 2022 •

edited

Loading

WEIIEW97 commented Sep 16, 2022

ironheads commented Sep 16, 2022 •

edited

Loading

WEIIEW97 commented Sep 16, 2022 •

edited

Loading

Segmentation fault while training #70

Segmentation fault while training #70

Comments

WEIIEW97 commented Sep 2, 2022

ironheads commented Sep 6, 2022

WEIIEW97 commented Sep 13, 2022

ironheads commented Sep 13, 2022 • edited Loading

WEIIEW97 commented Sep 16, 2022

ironheads commented Sep 16, 2022 • edited Loading

WEIIEW97 commented Sep 16, 2022 • edited Loading

ironheads commented Sep 13, 2022 •

edited

Loading

ironheads commented Sep 16, 2022 •

edited

Loading

WEIIEW97 commented Sep 16, 2022 •

edited

Loading