Why is broker resilience crucial in edge federations? If a worker fails, a broker can do its job (i.e. act as a worker) or allocate the same job to another worker. So fault remediation steps are possible and do not have a very high cost. However, if a broker fails, all tasks coming to that broker fail. The worker nodes can not be used. This makes broker resilience far more important.
There is a tradeoff to the number of brokers we need in the system. We assume that gateway devices send to the closest broker, breaking ties uniformly at random. If there are too many brokers, we have too less brokers, impacting performance of the system. If there are too less brokers, we have low number of single points of failures and possibly brokers can become bottlenecks. So we can increase or decrease number of brokers but both need to be considered.
Regular optimization techniques that use neural networks as a surrogate model claim that this approach is better because of goal-directed search (by using gradients compared to other search strategies. However, in discrete domains, there is an approximation that the surrogate surface would be smooth, i.e., the closest discrete point to the optimum would be optimum in the discrete space. However, this is not always true and can give rise to non-optimal solutions.
Another problem with GOBI is that we have no way to find out the confidence of the surrogate surface, i.e. the approximation of the real metrics. This leads us to either perform uncertainty based optimization (GOSH) or add other parameters such as system topology to improve performance (HUNTER). A problem arising from this is that we do not know when to fine-tune the model, so we need to do this at each interval, which might not be the best decision as variable and spiky loads can cause contention in constrained edge nodes. However, for adaptive systems, confidence is important to make sure we fine-tune only when needed. Thus instead of going from input graph to QoS metrics (Graph -> Metrics), we need to go from input and metrics to a confidence score (Graph + Metrics -> confidence score). This is similar to a discriminator network that predicts the probability of true data. We only need normal execution traces for this. We use a GON model here as we can now train using random samples and make sure that for unseen settings the confidence is lower (and facilitate decision making on when to fine-tune the model). We can use POT to find the confidence thresholds below which we train the model with latest info on the new graph topology till the confidence score crosses above the threshold value.
So now that we have a GON model and we know when to fine-tune it for a new topology, we can now run fault-tolerance steps. Our GON model takes as inputs, the graph topology and metrics that are initialized randomly. Now, at each step, we start from the previous topology (starting topology set by federation manager) and use second-order gradient optimization to find the optimal metrics such that the metrics are as expected. The converged GON output is the confidence score. We run tabu search on the topology using the neighbours found by the various node-shifts.
For experiments, we need only normal execution traces with diverse topologies to train the model. We need a fault model for test time.
Clone repo.
git clone https://github.com/imperial-qore/CAROL.git
cd CAROL/
Install dependencies.
sudo apt -y update
python3 -m pip --upgrade pip
python3 -m pip install matplotlib scikit-learn
python3 -m pip install -r requirements.txt
python3 -m pip install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
export PATH=$PATH:~/.local/bin
Change line 115 in main.py
to use one of the implemented fault-tolerance techniques: CAROLRecovery
, ECLBRecovery
, DYVERSERecovery
, ELBSRecovery
, LBOSRecovery
, FRASRecovery
, TopoMADRecovery
or StepGANRecovery
and run the code using the following command.
python3 main.py
Items | Contents |
---|---|
Pre-print | https://arxiv.org/pdf/2203.07140.pdf |
Contact | Shreshth Tuli (@shreshthtuli) |
Funding | Imperial President's scholarship |
Our work is accepted in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2022. Cite our work using the bibtex entry below.
@inproceedings{tuli2022carol,
title={{CAROL: Confidence-Aware Resilience Model for Edge Federations}},
author={Tuli, Shreshth and Casale, Giuliano and Jennings, Nicholas R},
booktitle={IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)},
year={2022},
organization={IEEE}
}
BSD-3-Clause. Copyright (c) 2022, Shreshth Tuli. All rights reserved.
See License file for more details.