Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need for Speed: Make Accel-Sim FAST again #284

Open
JRPan opened this issue Feb 21, 2024 · 11 comments · May be fixed by accel-sim/gpgpu-sim_distribution#67
Open

Need for Speed: Make Accel-Sim FAST again #284

JRPan opened this issue Feb 21, 2024 · 11 comments · May be fixed by accel-sim/gpgpu-sim_distribution#67
Assignees
Milestone

Comments

@JRPan
Copy link
Collaborator

JRPan commented Feb 21, 2024

Screenshot 2024-02-21 at 14 12 48

Timing profile from linux perf:

  1. cache_stats are called per-cycle by Accel-Wattch.
  2. xbar advance iterates all input->output which is O(n^2) which n is number of nodes. This is crazy.

These two functions add up to 2/3 of the simulation time. 28+21+17 = 66%.

Fix:

  1. include a global cache stat. So, you don't need to check for each SM each cycle. This is already done in dev-stream-stats branch. But needs some final polishing before being merged. For now, disable it if Accel-Wattch is not in use
  2. only check the nodes that have an input/output. Even though, in the worst case, every node has a package, and the function still needs to iterate through all nodes but I doubt how often that happens. Anyway the worst case is the same, but on average this should be much better.
@JRPan JRPan linked a pull request Feb 21, 2024 that will close this issue
@JRPan
Copy link
Collaborator Author

JRPan commented Feb 21, 2024

See accel-sim/gpgpu-sim_distribution#67 for the fix.

@rodhuega
Copy link
Contributor

Hi, I have been studying the fix. I have one suggestion and one question.

Let's start with the suggestion. Regarding the cache stats, the fix only works in case the execution is not using Accel-Wattch. However, many of us need that component. I propose to change those if that you have added to be: if(m_config.g_power_simulation_enabled && (((gpu_tot_sim_cycle + gpu_sim_cycle) + 1) % m_config.gpu_stat_sample_freq == 0)) instead of if (m_config.g_power_simulation_enabled) . In that way, the fix is also working in case Accel-Wattch is enabled. Notice the need for +1 because the mcpat_cycle function is triggered after gpu_sim_cycle++;. I have checked the energy report, and it provides the same outputs.

Moving to the question. I have tested the fix of the ICNT, and provides me different results. For example, a testing kernel that I have used to study the fix requires 2 cycles more to finish the execution. I wonder if this is expected or even if it is a better ICNT model. So, I would like to know if this matters or not.

PD: Thank you for the fix, I really appreciated.

@JRPan
Copy link
Collaborator Author

JRPan commented Feb 23, 2024

Thanks! Your solution regarding cache stats is better. I'll include that and keep that when we are adding the global cache stats.

Regarding ICNT, I expected it to be the same. I only intended to reduce the loop count. Do you mean it takes 2 cycles more in total to finish the execution? Or does each inst take 2 cycles more?
Anyway, I have to test it more thoroughly. That's why I marked the PR as a draft and want input from the community.
Thanks again, I will test it more.

A little side talk: are you the author of this paper https://arxiv.org/pdf/2401.10082.pdf? We really like this work and it would be great if we could merge this. Is this something you may be interested in?

@JRPan
Copy link
Collaborator Author

JRPan commented Feb 23, 2024

Actually, I think I accidentally fixed a bug. lol. Someone correct me if I am wrong

in the original code:

for each output node:
    for each input node:
        if input_buffer[0] can be transferred from input node to output node:
            remove package from input buffer and insert into output buffer

But imagine an extreme case: input buffer 0 has 10 packages for output nodes 0 to 10. So:
input_buffer[] = 0,1,2 ... 9

Then all 10 requests can be served within one cycle, which is wrong in design iSLIP with a single iteration. Each input should only be able to send 1 request each cycle.

Now in my design each output node can only accept one package per cycle.

Some background reading http://nms.lcs.mit.edu/6829-papers/islip-ton.pdf:
image

Again, I may be wrong. Inputs are highly welcomed.

@rodhuega
Copy link
Contributor

Regarding ICNT, I expected it to be the same. I only intended to reduce the loop count. Do you mean it takes 2 cycles more in total to finish the execution? Or does each inst take 2 cycles more? Anyway, I have to test it more thoroughly. That's why I marked the PR as a draft and want input from the community. Thanks again, I will test it more.

It is reporting 2 cycles more of the stat gpu_sim_cycle of the first kernel of Backprop that belongs to Rodinia 2. As an effect of that, many other stats were changing too. It was in my simulator version, but I suppose that is going to be similar in the current dev version as I have never changed the ICNT between SMs and memory partitions.

Actually, I think I accidentally fixed a bug. lol. Someone correct me if I am wrong

in the original code:

for each output node:
    for each input node:
        if input_buffer[0] can be transferred from input node to output node:
            remove package from input buffer and insert into output buffer

But imagine an extreme case: input buffer 0 has 10 packages for output nodes 0 to 10. So: input_buffer[] = 0,1,2 ... 9

Then all 10 requests can be served within one cycle, which is wrong in design iSLIP with a single iteration. Each input should only be able to send 1 request each cycle.

Now in my design each output node can only accept one package per cycle.

Some background reading http://nms.lcs.mit.edu/6829-papers/islip-ton.pdf: image

Again, I may be wrong. Inputs are highly welcomed.

I'm not expert in ICNT, but I agree that the algorithm is saying that only accepts one input grant each cycle. However, I'm not sure if I am seeing where that is prevented compared to the original code. I suppose that the prevention comes from the original body loop

for (unsigned i = 0; i < total_nodes; ++i) {
  if (Has_Buffer_Out(i, 1)) {
    for (unsigned j = 0; j < total_nodes; ++j) {
....
       if (grant_cycles_count == 1)
              next_node[i] = (++node_id % total_nodes);

to your new loop that goes over the set, but I'm not identifying where. I guess that is something related with start_node, but I'm not getting how it is prevented that all the dest can have the same next_node[dest] in a given cycle.

A little side talk: are you the author of this paper https://arxiv.org/pdf/2401.10082.pdf? We really like this work and it would be great if we could merge this. Is this something you may be interested in?

Yes, I'm one of the authors. I will be interested in merging that, but not right now. That arxiv PDF was a work in progress that we presented in CAMS 2023 and it is unfinished. We can discuss merging it when it is finish.

@JRPan
Copy link
Collaborator Author

JRPan commented Feb 26, 2024

It's prevented when building the set. https://github.com/accel-sim/gpgpu-sim_distribution/pull/67/files#diff-87ad483e1fe95a02fd73cf3251edacacbd39639eafa43b93150042ca68a0b511R190
In my code:

for each input
    package = input_node.head()
    if has package:
        input_set.insert(input_node)
        output_set.insert(output_node)

So I'm only checking the head of the input buffer. And both sets are unique. Then later in the nested loop, each input node and output node can only be iterated once. This prevented the bug.

Of course it can still be improved. The temp_node is redundant. But functional-wise this is the same.

@rodhuega
Copy link
Contributor

Ok, understood. I agree with you, now it is more correct.

By the way, I think these two modifications can also be applied to Vulkan-sim. Which I think is even slower than Accel-sim. I suppose this will have a good impact in that simulator too.

@JRPan
Copy link
Collaborator Author

JRPan commented Feb 27, 2024

Good idea!
Thanks.

@rodhuega
Copy link
Contributor

rodhuega commented Mar 4, 2024

Hi, I have tested your new version of the code, and produces different results compared to the other two versions. I mean, original has IPC A, intermediate IPC B, and the current version has IPC C. I don't know if you were awere of the changes in IPC between intermediate code that we were discussing some days ago and the current state.

@JRPan
Copy link
Collaborator Author

JRPan commented Mar 4, 2024

Thank you for pointing that out.

Yes, it should have some minor effects. The intermediate version was half-fixed.
I would expect original IPC > intermediate IPC > current IPC. Is this correct?

In the intermediate version, an input node can still transfer multiple packages in some corner cases.

However, I did not have time to thoroughly test everything (correlate every benchmark in the original paper). Our cluster is scheduled for upgrade.
For now, I do not recommend using this for any results that go into a paper.

But again, thanks for the input. I'll provide a full correlation once this change is ready.

@rodhuega
Copy link
Contributor

rodhuega commented Mar 5, 2024

Ok, I understand.

Regarding the IPC performance between versions, I have the following comparison with Backprop of Rodinia 2:
Initial:
gpu_tot_ipc = 361.4865
gpu_tot_ipc = 291.3595

Intermediate:
gpu_tot_ipc = 361.7782
gpu_tot_ipc = 291.9215

Final:
gpu_tot_ipc = 361.3072
gpu_tot_ipc = 291.6887

Until the first kernel:
Intermediate > Initial > Final

In the end:
Intermediate > Final > Initial

In my personal opinion, I won't care if the IPC is greater or smaller. It will change depending on the kernel. Moreover, the change is very little. I care more about the fact that now the code is doing what is expected.

Thanks for fixing that.

@JRPan JRPan mentioned this issue Jul 7, 2024
32 tasks
@JRPan JRPan added this to the v2.0 milestone Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants