Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VC starvation bug in dragonfly-dally #237

Open
kevinabrown opened this issue Jan 5, 2024 · 1 comment
Open

VC starvation bug in dragonfly-dally #237

kevinabrown opened this issue Jan 5, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@kevinabrown
Copy link
Contributor

The situation:
Some globally routed packets get stuck in the network and stop reaching their destinations when the network is heavily loaded.

Background and reproduction with nearest neighbor traffic

Traffic:
Nearest neighbor uses fixed-pair communication. Terminal X sends to terminal (X+1)%num_terminals.
Three types of traffic:

  • Neighbor on the same router: use compute links only
  • Neighbor in the same group: use compute and local links only
  • Neighbor in different group: use compute, local, and global link

VC usage:
Each router port has 4 VCs. Packets start in VC0 when they enter their source router. For subsequent routers, when choosing a VC to store an incoming packet, the deadlock avoidance algo in dragonfly dally chooses a VC based on where the packet is along its journey. For example, when a packet leaves it's source group, it is placed in the next higher VC #. For example, if it was in VC0 in the source group, it will be in VC1 in the next group. There are some more cases, but the packet never moves to a lower numbered VC anywhere along its path.

The cause:
When a port is sending packets, the VC arbitration algo chooses which VC to pick from. In CODES, the algo loops over all VCs and checks if they have any packets waiting. It breaks the loop and sends the first packet it finds. The loop always starts from 0, so it always check VC0 and finds a packet since the network is loaded with packets going to neighbors in the source group. Higher numbered VCs are starved, and their packets that have been routed globally are never delivered. This algo can be described as priority-first, where lower numbered VC have higher priority.

There are some other bugs and design issues that compounded this problem, but they are not relevant to the fix or our study so I won't discuss them here.

A fix
We can change the VC arbitration algo from priority-first to round-robin. This means, if the port had previously sent a packet from VC0, it will next start checking from VC1 and keep looping over the VC in a round-robin manner.

@kevinabrown kevinabrown self-assigned this Jan 5, 2024
@kevinabrown kevinabrown added the bug Something isn't working label Jan 5, 2024
@helq
Copy link
Member

helq commented Jan 12, 2024

A partial fix (by @kevinabrown following the strategy proposed above) can be found in commits 8e0f450 and 98aba5e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants