The simulator is event-driven (single threaded) and operates at the level of a large datacenter network topology. It takes the following inputs which are specified through a JSON input file:
- Topology: the topology type (e.g. FB Fabric) along with its parameters.
- Link failure trace: a trace containing link failure events where each
event is denoted as: <time>, <link id>, <loss rate>. For example:
349200,6136,6.5e-05
denotes that at time 349200, link ID 6136 in the network topology started corrupting packets with a loss rate of 6.5e-05. - Solution: The solution could either be
CorrOpt which is an
algorithm to disable a subset of the failed links or it could be the joint
strategy of
LinkGuardian + CorrOpt
as proposed in our paper (section 3.6). Any parameters corresponding to the solution are also required as the input; most importantly, the "capacity constraint" as per which the solution needs to operate.
The simulator then outputs a timeseries of several topology-level performance parameters, most important of which are the following:
- Total penalty: sum of the loss rates for all the active (remaining) corrupting links in the network.
- Least paths per ToR: the least fraction of paths to the spine (top) layer of the network for the worst-case ToR. This metric captures the impact on per-ToR path diversity as corrupting links are disabled for repair.
- Least capacity per pod: the total capacity in a network pod from the ToR-layer to the spine (top) layer for the worst-case pod in the network.