You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The situation:
Some globally routed packets get stuck in the network and stop reaching their destinations when the network is heavily loaded.
Background and reproduction with nearest neighbor traffic
Traffic:
Nearest neighbor uses fixed-pair communication. Terminal X sends to terminal (X+1)%num_terminals.
Three types of traffic:
Neighbor on the same router: use compute links only
Neighbor in the same group: use compute and local links only
Neighbor in different group: use compute, local, and global link
VC usage:
Each router port has 4 VCs. Packets start in VC0 when they enter their source router. For subsequent routers, when choosing a VC to store an incoming packet, the deadlock avoidance algo in dragonfly dally chooses a VC based on where the packet is along its journey. For example, when a packet leaves it's source group, it is placed in the next higher VC #. For example, if it was in VC0 in the source group, it will be in VC1 in the next group. There are some more cases, but the packet never moves to a lower numbered VC anywhere along its path.
The cause:
When a port is sending packets, the VC arbitration algo chooses which VC to pick from. In CODES, the algo loops over all VCs and checks if they have any packets waiting. It breaks the loop and sends the first packet it finds. The loop always starts from 0, so it always check VC0 and finds a packet since the network is loaded with packets going to neighbors in the source group. Higher numbered VCs are starved, and their packets that have been routed globally are never delivered. This algo can be described as priority-first, where lower numbered VC have higher priority.
There are some other bugs and design issues that compounded this problem, but they are not relevant to the fix or our study so I won't discuss them here.
A fix
We can change the VC arbitration algo from priority-first to round-robin. This means, if the port had previously sent a packet from VC0, it will next start checking from VC1 and keep looping over the VC in a round-robin manner.
The text was updated successfully, but these errors were encountered:
The situation:
Some globally routed packets get stuck in the network and stop reaching their destinations when the network is heavily loaded.
Background and reproduction with nearest neighbor traffic
Traffic:
Nearest neighbor uses fixed-pair communication. Terminal X sends to terminal (X+1)%num_terminals.
Three types of traffic:
VC usage:
Each router port has 4 VCs. Packets start in VC0 when they enter their source router. For subsequent routers, when choosing a VC to store an incoming packet, the deadlock avoidance algo in dragonfly dally chooses a VC based on where the packet is along its journey. For example, when a packet leaves it's source group, it is placed in the next higher VC #. For example, if it was in VC0 in the source group, it will be in VC1 in the next group. There are some more cases, but the packet never moves to a lower numbered VC anywhere along its path.
The cause:
When a port is sending packets, the VC arbitration algo chooses which VC to pick from. In CODES, the algo loops over all VCs and checks if they have any packets waiting. It breaks the loop and sends the first packet it finds. The loop always starts from 0, so it always check VC0 and finds a packet since the network is loaded with packets going to neighbors in the source group. Higher numbered VCs are starved, and their packets that have been routed globally are never delivered. This algo can be described as priority-first, where lower numbered VC have higher priority.
There are some other bugs and design issues that compounded this problem, but they are not relevant to the fix or our study so I won't discuss them here.
A fix
We can change the VC arbitration algo from priority-first to round-robin. This means, if the port had previously sent a packet from VC0, it will next start checking from VC1 and keep looping over the VC in a round-robin manner.
The text was updated successfully, but these errors were encountered: