-
I tried to deploy vHive on Google Compute Engine instead of bare metal but ran into some errors which are most likely network related. The setup included one master noted and two worker nodes, but I also ran into the same issues when using only one worker node. The setup includes 3 Ubuntu 18.04 VMs with nested virtualisation enabled on Logsworker0_containerd.log SetupFor completeness, here are the instructions I used to setup the VMs. The nodes were setup on 1. Image creationCreate an image for nested virtualisation. gcloud compute disks create nested-kvm-ub18-disk --image-project ubuntu-os-cloud --image-family ubuntu-1804-lts
gcloud compute images create nested-kvm-ub18-image \
--source-disk nested-kvm-ub18-disk --source-disk-zone $(gcloud config get-value compute/zone) \
--licenses "https://www.googleapis.com/compute/v1/projects/vm-options/global/licenses/enable-vmx" 2. Network setup2.1 Virtual Private Cloud NetworkCreate the gcloud compute networks create vhive-vpc --subnet-mode custom Provision a subnet with a large enough IP range to fit all nodes in the cluster. gcloud compute networks subnets create kubernetes-nodes \
--network vhive-vpc \
--range 10.240.0.0/24 2.2 Firewall rulesCreate a firewall rule that allows internal communication across TCP, UDP, ICMP and IP in IP (used for the Calico overlay): gcloud compute firewall-rules create vhive-vpc-allow-internal \
--allow tcp,udp,icmp,ipip \
--network vhive-vpc \
--source-ranges 10.240.0.0/24 Create a firewall rule that allows external SSH, ICMP, HTTPS and VXLAN. gcloud compute firewall-rules create vhive-vpc-allow-external \
--allow tcp:22,tcp:6443,icmp,udp:8472,udp:4789 \
--network vhive-vpc \
--source-ranges 0.0.0.0/0 3. Compute Instances3.1 Master nodegcloud compute instances create controller \
--async \
--boot-disk-size 50GB \
--boot-disk-type pd-ssd \
--can-ip-forward \
--image nested-kvm-ub18-image \
--machine-type n1-standard-2 \
--private-network-ip 10.240.0.11 \
--scopes compute-rw,storage-ro,service-management,service-control,logging-write,monitoring \
--subnet kubernetes-nodes \
--min-cpu-platform "Intel Haswell" \
--tags vhive,controller 3.2 Worker nodesfor i in 0 1; do
gcloud compute instances create worker-${i} \
--async \
--boot-disk-size 50GB \
--boot-disk-type pd-ssd \
--can-ip-forward \
--image nested-kvm-ub18-image \
--machine-type n1-standard-2 \
--private-network-ip 10.240.0.2${i} \
--scopes compute-rw,storage-ro,service-management,service-control,logging-write,monitoring \
--subnet kubernetes-nodes \
--min-cpu-platform "Intel Haswell" \
--tags vhive,worker
done 4. Node configuration4.1 VMX setupSSH into each node and check that virtualization is enabled by running the following command. A nonzero response confirms that nested virtualization is enabled. grep -cw vmx /proc/cpuinfo
Then enable KVM by running the following command. This should return "OK" if KVM was enabled successfully.
4.2 VHive setupvHive can now be setup using the vHive quick start guide. |
Beta Was this translation helpful? Give feedback.
Replies: 22 comments
-
@amohoste thanks for reporting this issue. From the logs, I can see only the failure of pulling an Docker image for rnn_serving function. This may happen when storage is somewhat slow because its image size is rather big. We have previously tested running vHive in a single VM (but not in a public cloud) so it should work. I suggest you start by deploying a single-node vHive cluster in a VM. Here are the instructions:
If this setup works, we can continue troubleshooting with the multi-node setting. If something doesn't work, please also provide the output of the commands above (along with the logs). |
Beta Was this translation helpful? Give feedback.
-
I tried running only the hello world example as instructed. On the first invocation, everything seems to work fine and the csv file gets populated with the latencies:
However, on the second invocation none of the requests complete anymore resulting in the CSV file being empty.
|
Beta Was this translation helpful? Give feedback.
-
@amohoste I think that the issue that you observe is not really a problem. The thing is that the invoker runs only for 5 second by default so probably the instances fail to reply by the time the invoker finishes. This is attributable to the fact that by default instances are booted from scratch (not from a snapshot) that takes roughly 3 sec on average. By default, the deployer client deploys functions with autoscaling enabled, meaning that by the time you invoke the function for the second time, all instances may be already down and need to be booted from scratch. I recommend you to run the invoker for a longer time (e.g., 20 seconds) by setting the corresponding runtime argument. |
Beta Was this translation helpful? Give feedback.
-
@amohoste Please let us know if the multi-node setup works. |
Beta Was this translation helpful? Give feedback.
-
Setting a longer invoker time indeed seems to resolve the issue for a single node. Thanks for the help. For the multimode setup everything seems to work fine when only running the hello world example:
However, when running the default functions.json which also contains rnn-serving and pyaes I encountered some errors. There seem to be some failures related to pulling the
master-ctrd.log |
Beta Was this translation helpful? Give feedback.
-
@amohoste it seems that our timeouts are too shot. Could you please specify the bandwidth of the network and the storage type that you have mounted as VMs' Could you please also specify the target deployment for your needs? and, if you can, the target workloads |
Beta Was this translation helpful? Give feedback.
-
I was using 2 n1-standard-4 (4 vCPUs, 15 GB memory) nodes with storage type SSD PersistentDisk for the measurements in my previous post. According to documentation, the max egress bandwidth for these machines is 10 Gbps. I also ran some iperf measurements for 60 seconds between the two nodes on TCP. For local traffic I receive bandwidths of 7.77 Gbps with one thread and 9.72 Gbps using 4 threads. For external traffic I obtained 4.82 Gbps with one thread and 6.67 Gbps using 4 threads. I am considering using vHive to experiment with implementing and evaluating different Serverless cloud function scheduling strategies as part of my master thesis. Consequentially, I will mainly be running benchmarks that should be representative to real Serverless workloads. As far as the target deployment is concerned, I am planning to use a Google Cloud setup with a good amount of nodes for the benchmarks. |
Beta Was this translation helpful? Give feedback.
-
Ok, thanks for the details. I think that for the setup that you described, pulling function images from Dockerhub at each worker node becomes a clear bottleneck, resulting in timeouts. What would make sense to do is to pre-pull the Docker images of the functions and stateful services that you intend to use into a k8s-cluster local registry. This is not vHive-specific, any open-source FaaS would have the same problem. Would you be willing to contribute this feature if we would provide all necessary guidance? Basically, you would need to deploy a docker-registry service, then we can add a runtime argument for Please let me know what you think. |
Beta Was this translation helpful? Give feedback.
-
Sure! I could look into that. I am currently evaluating the suitability of vHive versus other open source platforms for my use case but I'm leaning towards using vHive. Maybe we could discuss this along with the feature you described above through Slack or some other platform? |
Beta Was this translation helpful? Give feedback.
-
@amohoste sure, please message me on Slack and let's discuss |
Beta Was this translation helpful? Give feedback.
-
closing as the infrastructure works on GCP. |
Beta Was this translation helpful? Give feedback.
-
I am again running into issues where services fail to deploy. I think it might be related to the following error:
worker-0_fc-ctrd.log |
Beta Was this translation helpful? Give feedback.
-
@amohoste no, the CNI error message is not a real issue, just an artifact of how we start a cluster. As of now, deployment is not a critical issue because this is not what vHive users benchmark (as of now at least). For now, re-deploying should suffice. You may want to raise a separate issue for this problem and attach the logs. However, it is not going to be fixed soon because it is of a rather low priority and we do not see an easy fix as of now. Finally, in future, please re-open an issue if you add a comment because otherwise, it may fall through the cracks. |
Beta Was this translation helpful? Give feedback.
-
There was a problem when upgrading the dependencies' binaries that was not caught by the CI. It's already fixed. Please start with a clean set up. |
Beta Was this translation helpful? Give feedback.
-
also, we decided to re-iterate on existing sporadic failures asap. Hopefully, we'll fix this sporadic failure too. |
Beta Was this translation helpful? Give feedback.
-
@ustiugov Thanks for the heads up. Unfortunately, even now it still seems to be failing. Redeploying multiple times didn't help on the multinode setup. I also tried the one node setup with the fixed binaries but I am running into the same issue. If everything is working on your I can only imagine this beïng something related to google cloud. I am not able to re-open the issue however.
controller_ctrd.log |
Beta Was this translation helpful? Give feedback.
-
@amohoste interesting, is your repo up-to-date with the last commits from yesterday? we gonna check asap. |
Beta Was this translation helpful? Give feedback.
-
Correct, I'll try testing it with a commit from 15 days ago to double check if this is due to google cloud or a newly introduced issue |
Beta Was this translation helpful? Give feedback.
-
try March 5th commit: 5f69089 |
Beta Was this translation helpful? Give feedback.
-
Quick update, I am still investigating what is wrong. The issue also seems to appear on the March 5th commit. |
Beta Was this translation helpful? Give feedback.
-
I haven't been able to resolve the issue. The first time deploying, I get the usual timeout:
Upon invoking the deployer a second time, the deployment finishes almost instantly
However, Invoking the helloworld function does not seem to work
Only the containerd logs contain error messages which seem to be network related: |
Beta Was this translation helpful? Give feedback.
-
this is currently working properly. |
Beta Was this translation helpful? Give feedback.
this is currently working properly.