-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jenkins] Segmentation Fault in test_conv_extra config #226
Comments
Analysis of a failing caseFor now we are saving binary in the same thread where is is being built. If we build several kernels, then saving may happen concurrently. Assembling is very fast, so it is more likely that concurrency would occur after assembly than after OCL or HIP build. In the attached logs, three asm kernels are built in parallel (download and open in diff tool for details): The failure happens just after build, prior any [PrepareInvoker] and [EvaluateInvokers] calls, most likely during [SaveBinary]. Suspicious fragments shown below. Good one: MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-1979-aa97-3310-4d0d/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-a828-0239-1dce-3d85/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-75ab-f316-ef3b-746b/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: conv_3x3_wheel_alpha_v9_0_15.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 7e-05 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v16_5_0_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.000561 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v20_5_23_M_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.00012 ms
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: miopenSp3AsmConvRxSf3x2
MIOpen(HIP): Info2 [GetSolution] N=8 C=128 H=28 W=28 K=128 n_groups=60 flags=7 R=1 S=1 pad_H=0 pad_W=0 out_H=28 out_W=28
MIOpen(HIP): Info2 [GetSolution] ...flags=519 d_N_stride=401408 d_C_stride=3136 f_K_stride=4 f_C_stride=512 o_N_stride=401408 o_K_stride=3136
MIOpen(HIP): Info [EvaluateInvokers] ConvBinWinogradRxSf3x2: miopenSp3AsmConvRxSf3x2: 0.092 < 3.40282e+38
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: miopenSp3AsmConv_group_20_5_23_M_stride1 Bad: MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-58f3-cdab-b8aa-e17f/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-c90c-4e51-5e89-1c82/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-e7c4-5f56-e91f-1565/amdgcn-asm-out-XXXXXX'
UndefinedBehaviorSanitizer:DEADLYSIGNAL
==25716==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f5ac7d40501 bp 0x7f5a939fd9b0 sp 0x7f5a939fd8e0 T25728)
==25716==The signal is caused by a READ memory access.
==25716==Hint: address points to the zero page.
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: conv_3x3_wheel_alpha_v9_0_15.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.001303 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v16_5_0_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.000237 ms
#0 0x7f5ac7d40501 (/opt/rocm/bin/../lib/libhip_hcc.so+0x19501)
#1 0x7f5ac7d3a29b (/opt/rocm/bin/../lib/libhip_hcc.so+0x1329b)
#2 0x7f5ac7dc146d (/opt/rocm/bin/../lib/libhip_hcc.so+0x9a46d)
#3 0x7f5ac7dc2918 (/opt/rocm/bin/../lib/libhip_hcc.so+0x9b918)
#4 0x7f5acc3aefc9 (/var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/build/lib/libMIOpen.so.1+0x3ec5fc9)
#5 0x7f5acc3c112d (/var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/build/lib/libMIOpen.so.1+0x3ed812d)
... Note suspicious extra I tend to think that the reason is that SQLite binary cache is not fully MT safe yet. The issues disappear after switching to a file-based binary cache. Perhaps it is enough to stop removing the binary from the cache when cache is disabled here: However, even if this would resolve the problem, I seems worth to clearly identify the root cause of the issue first. /cc @JehandadKhan |
This issue can be closed upon PR #240 being merged. |
This is not so anymore, I just restored #226 in the #240 because of this failure during "Full long tests / FP32 gfx908 Hip Release All subset":
Needs to investigate. |
@atamazov, do you have a status update on this issue? |
No. |
@daniellowell, please assign someone to this blocking issue. |
@daniellowell, do you have an update on this issue? |
Only reproducible in Jenkins afaik |
@aserio Why priority changed from blocker to unknown? |
@atamazov, In both cases @daniellowell suggested the de-escalation. In the last change (from priority_high to priority_unknown) Daniel noted that we are unable to reproduce the issue outside of Jenkins. Feel free to re-prioritize the ticket if you have some insight here! |
I see. I am concluding that @daniellowell is thinking that the reason of our inability to reproduce these errors is "unusual" (of somewhat incorrect) Jenkins environment (that is not vanilla ROCm). This sounds reasonable. Let's assign low_priority. Eventually, after some Jenkins upgrade, we shall re-enable these tests ans see what happens. |
If we take the above as a current hypothesis, then this is not a bug. |
Hi, |
We need to assign someone and investigate. Do not close. |
bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0
and
bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 1, 1, 1, 1 --weights 1, 1, 3, 3 --pads_strides_dilations 1 1 2 2 2 1 --trans_output_pads 0 0
Failing configs on our Jenkins CI, gfx908.
Will temporarily disable until resolved.
Disabled in: #228
The text was updated successfully, but these errors were encountered: