The repo contains a demo of Federated Learning app deployed on a multi-cluster environment with Liqo and the popular FL framework Flower.
The deployed app is a simple ML model trained on the CIFAR-10 dataset using a simple CNN scratched with PyTorch.
The demo leverages a multi-cluster environment to run a distributed FL training. To setup the environment we need a:
- a central cluster:
- acts as a server
- pilots the application (offloads client pods to the leaf clusters)
- expose a private Service (ClusterIP) for the client to connects
- aggregrate the results and hosts the global model
- N leaf clusters:
- act as clients
- train their local model using local (sensitive) data
- share the updated weights to server by accessing the server Service
Architecture overview:
docker build -f ./build/Dockerfile.superexec -t <IMAGE_NAME>:<IMAGE_VERSION> ./demo
docker build -f ./build/Dockerfile.clientapp -t <IMAGE_NAME>:<IMAGE_VERSION> ./demo
To bootstrap the environment you need 1 cluster acting as a server, and N acting as clients.
Requirements:
- install Liqo on all clusters
- enable the liqo
RuntimeClass
feature. This is not strictly required, you can keep the RuntimeClass off, but you need to modify the chart adding NodeAffinities to all deployments/statefulsets (i.e., deployserver
on local nodes, whileclients
on liqo virtual nodes) - peer the central cluster (server) with all N client clusters. The central cluster acts as a consumer, while the leaf clusters acts as a provider (no bidirectional peering is required). Refer to the official docs for more info.
- install flwr cli
On the central cluster (the one acting as a server) run:
kubectl create ns flower-demo
liqoctl offload namespace flower-demo --namespace-mapping-strategy EnforceSameName
kubectl apply -f ./deploy/manifests -n flower-demo
Note: in the manifests/clients.yaml
file make sure the number of replicas of the StatefulSet and the NUM_CLIENT
env variable are equal to the number of clients (peered clusters).
On the central cluster, expose the server app endpoint (port 9093):
kubectl port-forward pods/<SERVER_POD> 9093:9093
Now you are ready to run the training with:
flwr run ./demo liqo