Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jenkins] Segmentation Fault in test_conv_extra config #226

Open
daniellowell opened this issue May 19, 2020 · 15 comments
Open

[Jenkins] Segmentation Fault in test_conv_extra config #226

daniellowell opened this issue May 19, 2020 · 15 comments

Comments

@daniellowell
Copy link
Contributor

daniellowell commented May 19, 2020

bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0

and

bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 1, 1, 1, 1 --weights 1, 1, 3, 3 --pads_strides_dilations 1 1 2 2 2 1 --trans_output_pads 0 0

Failing configs on our Jenkins CI, gfx908.
Will temporarily disable until resolved.

Disabled in: #228

@atamazov
Copy link
Contributor

Analysis of a failing case

For now we are saving binary in the same thread where is is being built. If we build several kernels, then saving may happen concurrently. Assembling is very fast, so it is more likely that concurrency would occur after assembly than after OCL or HIP build.

In the attached logs, three asm kernels are built in parallel (download and open in diff tool for details):

The failure happens just after build, prior any [PrepareInvoker] and [EvaluateInvokers] calls, most likely during [SaveBinary]. Suspicious fragments shown below. Good one:

MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-1979-aa97-3310-4d0d/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-a828-0239-1dce-3d85/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-75ab-f316-ef3b-746b/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file 
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: conv_3x3_wheel_alpha_v9_0_15.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 7e-05 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v16_5_0_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.000561 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v20_5_23_M_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.00012 ms
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: miopenSp3AsmConvRxSf3x2
MIOpen(HIP): Info2 [GetSolution]  N=8 C=128 H=28 W=28 K=128 n_groups=60 flags=7 R=1 S=1 pad_H=0 pad_W=0 out_H=28 out_W=28
MIOpen(HIP): Info2 [GetSolution] ...flags=519 d_N_stride=401408 d_C_stride=3136 f_K_stride=4 f_C_stride=512 o_N_stride=401408 o_K_stride=3136
MIOpen(HIP): Info [EvaluateInvokers] ConvBinWinogradRxSf3x2: miopenSp3AsmConvRxSf3x2: 0.092 < 3.40282e+38
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: miopenSp3AsmConv_group_20_5_23_M_stride1

Bad:

MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-58f3-cdab-b8aa-e17f/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-c90c-4e51-5e89-1c82/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-e7c4-5f56-e91f-1565/amdgcn-asm-out-XXXXXX'
UndefinedBehaviorSanitizer:DEADLYSIGNAL
==25716==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f5ac7d40501 bp 0x7f5a939fd9b0 sp 0x7f5a939fd8e0 T25728)
==25716==The signal is caused by a READ memory access.
==25716==Hint: address points to the zero page.
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file 
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: conv_3x3_wheel_alpha_v9_0_15.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.001303 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v16_5_0_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.000237 ms
    #0 0x7f5ac7d40501  (/opt/rocm/bin/../lib/libhip_hcc.so+0x19501)
    #1 0x7f5ac7d3a29b  (/opt/rocm/bin/../lib/libhip_hcc.so+0x1329b)
    #2 0x7f5ac7dc146d  (/opt/rocm/bin/../lib/libhip_hcc.so+0x9a46d)
    #3 0x7f5ac7dc2918  (/opt/rocm/bin/../lib/libhip_hcc.so+0x9b918)
    #4 0x7f5acc3aefc9  (/var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/build/lib/libMIOpen.so.1+0x3ec5fc9)
    #5 0x7f5acc3c112d  (/var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/build/lib/libMIOpen.so.1+0x3ed812d)
...	

Note suspicious extra MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file and MIOpen(HIP): Info [KernDb] database not present in the failing log (however these may relate to something else).

I tend to think that the reason is that SQLite binary cache is not fully MT safe yet. The issues disappear after switching to a file-based binary cache.

Perhaps it is enough to stop removing the binary from the cache when cache is disabled here:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/3ed5e8569ec82ea2d5f8706d279e2074ceeb8a5a/src/binary_cache.cpp#L163-L164

However, even if this would resolve the problem, I seems worth to clearly identify the root cause of the issue first.

/cc @JehandadKhan

@atamazov atamazov mentioned this issue May 25, 2020
11 tasks
@JehandadKhan
Copy link
Contributor

@atamazov There is no harm in removing that line, however, please note that since the case at hand is an empty database it would return from the call immediately here

@aserio
Copy link
Contributor

aserio commented May 29, 2020

This issue can be closed upon PR #240 being merged.

@atamazov
Copy link
Contributor

This is not so anymore, I just restored #226 in the #240 because of this failure during "Full long tests / FP32 gfx908 Hip Release All subset":

../bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0 
MIOpen(HIP): Warning [Prefetch] File is unreadable: /var/jenkins/workspace/Libs_MIOpen_fix-jenkins-failures/src/kernels/gfx90878.HIP.fdb.txt
error: 4.71152e-08
Max diff: 0.00146484
Mismatch at 0: 1309 != 1309
Forward convolution: fft
Input tensor: 4, 64, 14, 14
Weights tensor: 24, 64, 5, 5
Output tensor: 
Filter: conv2d, miopenConvolution, miopenPaddingDefault, {2, 2}, {1, 1}, {1, 1}, 
../bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0 
error: 0
Max diff: 0
Backward convolution: ConvBinWinogradRxS
Input tensor: 4, 64, 14, 14
Weights tensor: 24, 64, 5, 5
Output tensor: 4, 24, 14, 14
Filter: conv2d, miopenConvolution, miopenPaddingDefault, {2, 2}, {1, 1}, {1, 1}, 
../bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0 
Memory access fault by GPU node-4 (Agent handle: 0x1dbb0f0) on address 0x7f9f09444000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
test/CMakeFiles/test_conv_extra.dir/build.make:57: recipe for target 'test/CMakeFiles/test_conv_extra' failed

Needs to investigate.

@aserio
Copy link
Contributor

aserio commented Jun 4, 2020

@atamazov, do you have a status update on this issue?

@atamazov atamazov removed their assignment Jun 5, 2020
@atamazov
Copy link
Contributor

atamazov commented Jun 5, 2020

No.

@aserio
Copy link
Contributor

aserio commented Jun 8, 2020

@daniellowell, please assign someone to this blocking issue.

@aserio
Copy link
Contributor

aserio commented Jun 19, 2020

@daniellowell, do you have an update on this issue?

@aserio
Copy link
Contributor

aserio commented Jun 30, 2020

Only reproducible in Jenkins afaik

@atamazov
Copy link
Contributor

atamazov commented Jul 1, 2020

@aserio Why priority changed from blocker to unknown?

@aserio
Copy link
Contributor

aserio commented Jul 1, 2020

@atamazov, In both cases @daniellowell suggested the de-escalation. In the last change (from priority_high to priority_unknown) Daniel noted that we are unable to reproduce the issue outside of Jenkins. Feel free to re-prioritize the ticket if you have some insight here!

@atamazov
Copy link
Contributor

atamazov commented Jul 1, 2020

I see. I am concluding that @daniellowell is thinking that the reason of our inability to reproduce these errors is "unusual" (of somewhat incorrect) Jenkins environment (that is not vanilla ROCm). This sounds reasonable. Let's assign low_priority. Eventually, after some Jenkins upgrade, we shall re-enable these tests ans see what happens.

@atamazov atamazov changed the title Segmentation Fault in test_conv_extra config [Jenkins] Segmentation Fault in test_conv_extra config Jul 1, 2020
@atamazov
Copy link
Contributor

atamazov commented Jul 1, 2020

If we take the above as a current hypothesis, then this is not a bug.

@atamazov atamazov added quality and removed bug labels Jul 1, 2020
@ppanchad-amd
Copy link

Hi,
Is this still reproducible on Jenkins, if not, can we close this bug?

@atamazov
Copy link
Contributor

We need to assign someone and investigate. Do not close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants