Setup ray head fault tolerance with Redis #22

imreddy13 · 2023-08-30T02:13:07Z

Add fault tolerance to the ray cluster via Redis so the GCS state is persisted.

The ray cluster has no fault tolerance by default i.e. if the head node dies, the application is disrupted because all the worker nodes restart. Additionally, job logs stored in the ray dashboard are lost.

The following changes enable fault tolerance:

Deploy a Redis service in the GKE cluster.
Modify the kuberay config with annotations that enable fault tolerance and set environment variable RAY_REDIS_ADDRESS to the Redis service.

This can be tested by creating a ray cluster with fault tolerance enabled, then force deleting the head node pod and verifying the worker pod does not restart.

Setup GCS fault tolerance with Redis for ray

6eaa4d7

imreddy13 requested review from richardsliu and spencer-p August 30, 2023 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup ray head fault tolerance with Redis #22

Setup ray head fault tolerance with Redis #22

imreddy13 commented Aug 30, 2023

Setup ray head fault tolerance with Redis #22

Are you sure you want to change the base?

Setup ray head fault tolerance with Redis #22

Conversation

imreddy13 commented Aug 30, 2023