CAROL

Importance of Broker Resilience

Why is broker resilience crucial in edge federations? If a worker fails, a broker can do its job (i.e. act as a worker) or allocate the same job to another worker. So fault remediation steps are possible and do not have a very high cost. However, if a broker fails, all tasks coming to that broker fail. The worker nodes can not be used. This makes broker resilience far more important.

There is a tradeoff to the number of brokers we need in the system. We assume that gateway devices send to the closest broker, breaking ties uniformly at random. If there are too many brokers, we have too less brokers, impacting performance of the system. If there are too less brokers, we have low number of single points of failures and possibly brokers can become bottlenecks. So we can increase or decrease number of brokers but both need to be considered.

Importance of Confidence-Aware Training

Regular optimization techniques that use neural networks as a surrogate model claim that this approach is better because of goal-directed search (by using gradients compared to other search strategies. However, in discrete domains, there is an approximation that the surrogate surface would be smooth, i.e., the closest discrete point to the optimum would be optimum in the discrete space. However, this is not always true and can give rise to non-optimal solutions.

Another problem with GOBI is that we have no way to find out the confidence of the surrogate surface, i.e. the approximation of the real metrics. This leads us to either perform uncertainty based optimization (GOSH) or add other parameters such as system topology to improve performance (HUNTER). A problem arising from this is that we do not know when to fine-tune the model, so we need to do this at each interval, which might not be the best decision as variable and spiky loads can cause contention in constrained edge nodes. However, for adaptive systems, confidence is important to make sure we fine-tune only when needed. Thus instead of going from input graph to QoS metrics (Graph -> Metrics), we need to go from input and metrics to a confidence score (Graph + Metrics -> confidence score). This is similar to a discriminator network that predicts the probability of true data. We only need normal execution traces for this. We use a GON model here as we can now train using random samples and make sure that for unseen settings the confidence is lower (and facilitate decision making on when to fine-tune the model). We can use POT to find the confidence thresholds below which we train the model with latest info on the new graph topology till the confidence score crosses above the threshold value.

CAROL Approach

So now that we have a GON model and we know when to fine-tune it for a new topology, we can now run fault-tolerance steps. Our GON model takes as inputs, the graph topology and metrics that are initialized randomly. Now, at each step, we start from the previous topology (starting topology set by federation manager) and use second-order gradient optimization to find the optimal metrics such that the metrics are as expected. The converged GON output is the confidence score. We run tabu search on the topology using the neighbours found by the various node-shifts.

For experiments, we need only normal execution traces with diverse topologies to train the model. We need a fault model for test time.

Quick Test

Clone repo.

git clone https://github.com/imperial-qore/CAROL.git
cd CAROL/

Install dependencies.

sudo apt -y update
python3 -m pip --upgrade pip
python3 -m pip install matplotlib scikit-learn
python3 -m pip install -r requirements.txt
python3 -m pip install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
export PATH=$PATH:~/.local/bin

Change line 115 in main.py to use one of the implemented fault-tolerance techniques: CAROLRecovery, ECLBRecovery, DYVERSERecovery, ELBSRecovery, LBOSRecovery, FRASRecovery, TopoMADRecovery or StepGANRecovery and run the code using the following command.

python3 main.py

External Links

Items	Contents
Pre-print	https://arxiv.org/pdf/2203.07140.pdf
Contact	Shreshth Tuli (@shreshthtuli)
Funding	Imperial President's scholarship

Cite this work

Our work is accepted in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2022. Cite our work using the bibtex entry below.

@inproceedings{tuli2022carol,
  title={{CAROL: Confidence-Aware Resilience Model for Edge Federations}},
  author={Tuli, Shreshth and Casale, Giuliano and Jennings, Nicholas R},
  booktitle={IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)},
  year={2022},
  organization={IEEE}
}

License

See License file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
debug		debug
framework		framework
metrics		metrics
recovery		recovery
scheduler		scheduler
simulator		simulator
stats		stats
utils		utils
wiki		wiki
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
grapher.py		grapher.py
install.py		install.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAROL

Importance of Broker Resilience

Importance of Confidence-Aware Training

CAROL Approach

Quick Test

External Links

Cite this work

License

About

Releases

Packages

Languages

License

imperial-qore/CAROL

Folders and files

Latest commit

History

Repository files navigation

CAROL

Importance of Broker Resilience

Importance of Confidence-Aware Training

CAROL Approach

Quick Test

External Links

Cite this work

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages