-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need for Speed: Make Accel-Sim FAST again #284
Need for Speed: Make Accel-Sim FAST again #284
Comments
See accel-sim/gpgpu-sim_distribution#67 for the fix. |
Hi, I have been studying the fix. I have one suggestion and one question. Let's start with the suggestion. Regarding the cache stats, the fix only works in case the execution is not using Accel-Wattch. However, many of us need that component. I propose to change those if that you have added to be: Moving to the question. I have tested the fix of the ICNT, and provides me different results. For example, a testing kernel that I have used to study the fix requires 2 cycles more to finish the execution. I wonder if this is expected or even if it is a better ICNT model. So, I would like to know if this matters or not. PD: Thank you for the fix, I really appreciated. |
Thanks! Your solution regarding cache stats is better. I'll include that and keep that when we are adding the global cache stats. Regarding ICNT, I expected it to be the same. I only intended to reduce the loop count. Do you mean it takes 2 cycles more in total to finish the execution? Or does each inst take 2 cycles more? A little side talk: are you the author of this paper https://arxiv.org/pdf/2401.10082.pdf? We really like this work and it would be great if we could merge this. Is this something you may be interested in? |
Actually, I think I accidentally fixed a bug. lol. Someone correct me if I am wrong in the original code:
But imagine an extreme case: input buffer 0 has 10 packages for output nodes 0 to 10. So: Then all 10 requests can be served within one cycle, which is wrong in design iSLIP with a single iteration. Each input should only be able to send 1 request each cycle. Now in my design each output node can only accept one package per cycle. Some background reading http://nms.lcs.mit.edu/6829-papers/islip-ton.pdf: Again, I may be wrong. Inputs are highly welcomed. |
It is reporting 2 cycles more of the stat
I'm not expert in ICNT, but I agree that the algorithm is saying that only accepts one input grant each cycle. However, I'm not sure if I am seeing where that is prevented compared to the original code. I suppose that the prevention comes from the original body loop
to your new loop that goes over the set, but I'm not identifying where. I guess that is something related with
Yes, I'm one of the authors. I will be interested in merging that, but not right now. That arxiv PDF was a work in progress that we presented in CAMS 2023 and it is unfinished. We can discuss merging it when it is finish. |
It's prevented when building the set. https://github.com/accel-sim/gpgpu-sim_distribution/pull/67/files#diff-87ad483e1fe95a02fd73cf3251edacacbd39639eafa43b93150042ca68a0b511R190
So I'm only checking the head of the input buffer. And both sets are unique. Then later in the nested loop, each input node and output node can only be iterated once. This prevented the bug. Of course it can still be improved. The temp_node is redundant. But functional-wise this is the same. |
Ok, understood. I agree with you, now it is more correct. By the way, I think these two modifications can also be applied to Vulkan-sim. Which I think is even slower than Accel-sim. I suppose this will have a good impact in that simulator too. |
Good idea! |
Hi, I have tested your new version of the code, and produces different results compared to the other two versions. I mean, original has IPC A, intermediate IPC B, and the current version has IPC C. I don't know if you were awere of the changes in IPC between intermediate code that we were discussing some days ago and the current state. |
Thank you for pointing that out. Yes, it should have some minor effects. The intermediate version was half-fixed. In the intermediate version, an input node can still transfer multiple packages in some corner cases. However, I did not have time to thoroughly test everything (correlate every benchmark in the original paper). Our cluster is scheduled for upgrade. But again, thanks for the input. I'll provide a full correlation once this change is ready. |
Ok, I understand. Regarding the IPC performance between versions, I have the following comparison with Backprop of Rodinia 2: Intermediate: Final: Until the first kernel: In the end: In my personal opinion, I won't care if the IPC is greater or smaller. It will change depending on the kernel. Moreover, the change is very little. I care more about the fact that now the code is doing what is expected. Thanks for fixing that. |
Timing profile from linux perf:
cache_stats
are called per-cycle by Accel-Wattch.O(n^2)
which n is number of nodes. This is crazy.These two functions add up to 2/3 of the simulation time.
28+21+17 = 66%
.Fix:
dev-stream-stats
branch. But needs some final polishing before being merged. For now, disable it if Accel-Wattch is not in useThe text was updated successfully, but these errors were encountered: