Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segF when evaluating pytorch file using gpgpu-sim #256

Open
Wen-Tian-Pineapple opened this issue Nov 7, 2023 · 12 comments
Open

segF when evaluating pytorch file using gpgpu-sim #256

Wen-Tian-Pineapple opened this issue Nov 7, 2023 · 12 comments

Comments

@Wen-Tian-Pineapple
Copy link

image
image
When evaluating python program I ran into above error, the last line of gpgpu output is just thread block.
I used PyInstaller to compile the python file and use the compiled file from "dist" directory, maybe that cause this problem?
Also I'm using cuda11.01 and gcc 7.3.1 on centOS7
Anybody have any idea what the issue might be?

@JRPan
Copy link
Collaborator

JRPan commented Nov 8, 2023

There is SEGF. You can run this in directly or in gdb to see which line caused the issue.

Thanks

@Wen-Tian-Pineapple
Copy link
Author

Thanks, Currently I'm trying to debug it with gdb
Also I wanted to mention that I did the same thing as #238, was trying to run all the things in the docker provided in this repo and was experiencing the similar segmentation fault when processing kernel-17.traceg.
image
image

@JRPan
Copy link
Collaborator

JRPan commented Nov 8, 2023

Yea that's fine. If you can tell me the exact line which has SEGF I can provide some hints and help you narrow down the issue.

@Wen-Tian-Pineapple
Copy link
Author

Yea that's fine. If you can tell me the exact line which has SEGF I can provide some hints and help you narrow down the issue.

Thanks for the reply, So below is the problem.
image

@Wen-Tian-Pineapple
Copy link
Author

The segF happens with DEPBAR instruction and specifically when "Check for the case that the LDGSTSs monitored have finished when encountering the DEPBAR instruction"
Maybe the new added LDGSTS Support is buggy?

@JRPan
Copy link
Collaborator

JRPan commented Nov 8, 2023

@Connie120

@tyhiwzm
Copy link

tyhiwzm commented Nov 8, 2023

The segF happens with DEPBAR instruction and specifically when "Check for the case that the LDGSTSs monitored have finished when encountering the DEPBAR instruction" Maybe the new added LDGSTS Support is buggy?

While using the parboil-sad workload, I encountered the same error in the dev version.

@JRPan
Copy link
Collaborator

JRPan commented Nov 8, 2023

Thank you for the info.

Unfortunately we have a major conference deadline approaching and we won't be able to work on that soon.

You may look into it if you want, we are happy to accept any fix.

I suggest you checkout a commit right before the LDGSTS merge and use that for now.

Thanks!

@JRPan
Copy link
Collaborator

JRPan commented Nov 8, 2023

@Wen-Tian-Pineapple What workload is this?

You are using TITANV config which does not have DEPBAR feature. So possibly we did not disable this correctly on old configs that do not have the feature.

Please use https://github.com/accel-sim/gpgpu-sim_distribution/tree/53e99da4d21eacbf103ba55bcc9cb6e05219cb91 and https://github.com/accel-sim/accel-sim-framework/tree/241762826c193e6589ea9959bd074d94c826bc15 instead

@tyhiwzm Which config are you using?

Thanks!

@Wen-Tian-Pineapple
Copy link
Author

@Wen-Tian-Pineapple What workload is this?

You are using TITANV config which does not have DEPBAR feature. So possibly we did not disable this correctly on old configs that do not have the feature.

Please use https://github.com/accel-sim/gpgpu-sim_distribution/tree/53e99da4d21eacbf103ba55bcc9cb6e05219cb91 and https://github.com/accel-sim/accel-sim-framework/tree/241762826c193e6589ea9959bd074d94c826bc15 instead

@tyhiwzm Which config are you using?

Thanks!

Thanks for the reference, now it's working fine. BTW I'm using titanX/titanV config.

@tyhiwzm
Copy link

tyhiwzm commented Nov 9, 2023

@Wen-Tian-Pineapple What workload is this?

You are using TITANV config which does not have DEPBAR feature. So possibly we did not disable this correctly on old configs that do not have the feature.

Please use https://github.com/accel-sim/gpgpu-sim_distribution/tree/53e99da4d21eacbf103ba55bcc9cb6e05219cb91 and https://github.com/accel-sim/accel-sim-framework/tree/241762826c193e6589ea9959bd074d94c826bc15 instead

@tyhiwzm Which config are you using?

Thanks!

Thank you for your reply.
I'm using A100 config, just like #138. The machine I am actually using is also an A100.
I figured out what's causing the problem. The DEPBAR instruction shows up in the trace I captured, but there's no LDGSTS instruction. So that makes m_warp[warp_id]->m_ldgdepbar_buf[i].size() in shader_core_ctx::issue_warp throw an error, since m_ldgdepbar_buf is empty.

@Wen-Tian-Pineapple
Copy link
Author

BTW Would be great to know when this issue is fixed so I can download the newest version i dev branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants