-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] GCS FT with redis sentinel #47335
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Kan Wang <[email protected]>
Signed-off-by: Kan Wang <[email protected]>
Signed-off-by: Kan Wang <[email protected]>
hey @rkooo567 gentle ping on this PR. |
Hey, |
@kanwang per Ray Slack; aiming to get to this early next week |
hey @anyscalesam |
Hi @kanwang sorry, the team is a bit busy right now. We will review it ASAP. |
Hey team, any chance that we can get this reviewed this week or early next week? Thank you! cc @jjyao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sorry for the late review. I'll actively review this PR from now on.
src/ray/gcs/redis_context.cc
Outdated
@@ -506,6 +564,14 @@ Status RedisContext::Connect(const std::string &address, | |||
// Ray has some restrictions for RedisDB. Validate it here. | |||
ValidateRedisDB(*this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's Redis Sentinel, what will INFO CLUSTER
returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will return empty - basically that section doesn't exist. Here's a full result of INFO
on redis sentinel:
# Server
redis_version:7.0.7
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:5dad631ce1b0fc10
redis_mode:sentinel
os:Linux 6.8.0-1017-aws x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:9.4.0
process_id:7
process_supervised:no
run_id:de8c925350decfbb0abd5940bf718855eb191f8d
tcp_port:26379
server_time_usec:1730346466295003
uptime_in_seconds:728236
uptime_in_days:8
hz:14
configured_hz:10
lru_clock:2293218
executable:/redis-sentinel
config_file:/etc/redis/sentinel.conf
io_threads_active:0
# Clients
connected_clients:3
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:20480
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
# Stats
total_connections_received:121374
total_commands_processed:2334908
instantaneous_ops_per_sec:3
total_net_input_bytes:135052866
total_net_output_bytes:163753996
total_net_repl_input_bytes:0
total_net_repl_output_bytes:0
instantaneous_input_kbps:0.17
instantaneous_output_kbps:0.02
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:12828
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:0
total_forks:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:97084
dump_payload_sanitizations:0
total_reads_processed:2502068
total_writes_processed:2380696
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:29
reply_buffer_expands:0
# CPU
used_cpu_sys:1282.061897
used_cpu_user:931.763629
used_cpu_sys_children:0.032320
used_cpu_user_children:0.026019
used_cpu_sys_main_thread:1282.061999
used_cpu_user_main_thread:931.762826
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_tilt_since_seconds:-1
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=redis-ha,status=ok,address=10.112.187.90:6379,slaves=2,sentinels=3
src/ray/gcs/redis_context.cc
Outdated
// directly. continue otherwise. | ||
return sentinel_status; | ||
} | ||
|
||
// Find the true leader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed only for Redis Cluster.
I feel our high level code for RedisContext::Connect
should be
if (redis_cluster) {
ValidateRedisCluster() // make sure it only has 1 shard
FindMasterUsingMOVED()
} else {
// redis sentinel
FindMasterUsingSentinel()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion! refactored a little bit. so now the logic in RedisContext::Connect
should be mostly untouched: added one validation of redis_sentienel and an else branch to handle sentinel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be ready for re-review.
Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: kanwang <[email protected]>
Signed-off-by: Kan Wang <[email protected]> Signed-off-by: Kan Wang <[email protected]>
Signed-off-by: Kan Wang <[email protected]> Signed-off-by: Kan Wang <[email protected]>
Signed-off-by: Kan Wang <[email protected]>
Signed-off-by: Kan Wang <[email protected]> Signed-off-by: Kan Wang <[email protected]>
Signed-off-by: Kan Wang <[email protected]> Signed-off-by: Kan Wang <[email protected]>
hey @jjyao gentle ping on this. thank you! |
hey @jjyao checking on this again since all tests passed now. |
Hi @kanwang I'll take another review this week. |
src/ray/gcs/redis_context.cc
Outdated
@@ -431,6 +431,69 @@ void ValidateRedisDB(RedisContext &context) { | |||
} | |||
} | |||
|
|||
Status ValidateRedisSentinel(RedisContext &context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool IsRedisSentinel(RedisContext &context) {
}
src/ray/gcs/redis_context.cc
Outdated
// if type error, this is a redis cluster. continue to validate and connect | ||
|
||
// Ray has some restrictions for RedisDB. Validate it here. | ||
ValidateRedisDB(*this); | ||
|
||
// Find the true leader | ||
std::vector<const char *> argv; | ||
std::vector<size_t> argc; | ||
std::vector<std::string> cmds = {"DEL", "DUMMY"}; | ||
for (const auto &arg : cmds) { | ||
argv.push_back(arg.data()); | ||
argc.push_back(arg.size()); | ||
} | ||
|
||
// Find the true leader | ||
std::vector<const char *> argv; | ||
std::vector<size_t> argc; | ||
std::vector<std::string> cmds = {"DEL", "DUMMY"}; | ||
for (const auto &arg : cmds) { | ||
argv.push_back(arg.data()); | ||
argc.push_back(arg.size()); | ||
} | ||
auto redis_reply = reinterpret_cast<redisReply *>( | ||
::redisCommandArgv(context_.get(), cmds.size(), argv.data(), argc.data())); | ||
|
||
if (redis_reply->type == REDIS_REPLY_ERROR) { | ||
// This should be a MOVED error | ||
// MOVED 14946 10.xx.xx.xx:7001 | ||
std::string error_msg(redis_reply->str, redis_reply->len); | ||
freeReplyObject(redis_reply); | ||
auto maybe_ip_port = ParseIffMovedError(error_msg); | ||
RAY_CHECK(maybe_ip_port.has_value()) | ||
<< "Setup Redis cluster failed in the dummy deletion: " << error_msg; | ||
Disconnect(); | ||
const auto &[ip, port] = maybe_ip_port.value(); | ||
// Connect to the true leader. | ||
RAY_LOG(INFO) << "Redis cluster leader is " << ip << ":" << port | ||
<< ". Reconnect to it."; | ||
return Connect(ip, port, password, enable_ssl); | ||
} else { | ||
RAY_LOG(INFO) << "Redis cluster leader is " << ip_addresses[0] << ":" << port; | ||
freeReplyObject(redis_reply); | ||
} | ||
|
||
auto redis_reply = reinterpret_cast<redisReply *>( | ||
::redisCommandArgv(context_.get(), cmds.size(), argv.data(), argc.data())); | ||
|
||
if (redis_reply->type == REDIS_REPLY_ERROR) { | ||
// This should be a MOVED error | ||
// MOVED 14946 10.xx.xx.xx:7001 | ||
std::string error_msg(redis_reply->str, redis_reply->len); | ||
freeReplyObject(redis_reply); | ||
auto maybe_ip_port = ParseIffMovedError(error_msg); | ||
RAY_CHECK(maybe_ip_port.has_value()) | ||
<< "Setup Redis cluster failed in the dummy deletion: " << error_msg; | ||
Disconnect(); | ||
const auto &[ip, port] = maybe_ip_port.value(); | ||
// Connect to the true leader. | ||
RAY_LOG(INFO) << "Redis cluster leader is " << ip << ":" << port | ||
<< ". Reconnect to it."; | ||
return Connect(ip, port, password, enable_ssl); | ||
return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into a private function as well so we can have
if (IsRedisSentinel(*this)) {
return ConnectRedisSentinel();
} else {
return ConnectRedisCluster();
}
src/ray/gcs/redis_context.cc
Outdated
|
||
RAY_CHECK(redis_reply && redis_reply->type == REDIS_REPLY_ARRAY) | ||
<< "failed to get redis sentinel masters info"; | ||
RAY_CHECK(redis_reply->elements == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_CHECK_EQ
src/ray/gcs/redis_context.cc
Outdated
auto redis_reply = reinterpret_cast<redisReply *>( | ||
::redisCommandArgv(context.sync_context(), cmds.size(), argv.data(), argc.data())); | ||
|
||
RAY_CHECK(redis_reply && redis_reply->type == REDIS_REPLY_ARRAY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split into two RAY_CHECKs so that if it fails we know which one fails
src/ray/gcs/redis_context.cc
Outdated
RAY_LOG(FATAL) | ||
<< "failed to get the ip and port of the primary node from redis sentinel"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FATAL will exit the process, I think you want ERROR
Signed-off-by: Kan Wang <[email protected]>
Signed-off-by: Kan Wang <[email protected]>
Thanks! Addressed the feedbacks. |
Signed-off-by: Kan Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG. I'll review the tests later.
Have you tested with real Redis Sentinel?
bool isRedisSentinel(RedisContext &context) { | ||
auto reply = context.RunArgvSync(std::vector<std::string>{"INFO", "SENTINEL"}); | ||
if (reply->IsNil() || reply->IsError() || reply->ReadAsString().length() == 0) { | ||
RAY_LOG(INFO) << "failed to get redis sentinel info, continue as a regular redis."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this log and add logs inside ConnectRedisCluster and ConnectRedisSentinel like
RAY_LOG(INFO) << "Connect to Redis cluster/sentinel";
::redisCommandArgv(context.sync_context(), cmds.size(), argv.data(), argc.data())); | ||
|
||
RAY_CHECK(redis_reply) << "failed to get redis sentinel masters info"; | ||
RAY_CHECK(redis_reply->type == REDIS_REPLY_ARRAY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_CHECK_EQ
<< "redis sentinel master info should be REDIS_REPLY_ARRAY but got " | ||
<< redis_reply->type; | ||
RAY_CHECK_EQ(redis_reply->elements, 1) | ||
<< "expecting only one primary behind the redis sentinel"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< "expecting only one primary behind the redis sentinel"; | |
<< "There should be only one primary behind the Redis sentinel"; |
RAY_LOG(INFO) << "connecting to the redis primary node behind sentinel: " << actual_ip | ||
<< ":" << actual_port; | ||
context.Disconnect(); | ||
return context.Connect(actual_ip, std::stoi(actual_port), password, enable_ssl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we connect to the primary, IsRedisSentinel will return false now?
Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: kanwang <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: kanwang <[email protected]>
Why are these changes needed?
We want to use redis sentinel to support Ray GCS FT. I opened a ticket here: #46983. Redis sentinel should provide high availability without worrying too much about redis cluster operations.
Related issue number
Closes #46983
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.