Split Redis configuration for each connection pool by database #3195

achimnol · 2024-12-04T13:16:42Z

Synopsis

In some customer sites, we have experienced:

Too high CPU usage of the Redis service
Excessive disk usage by AOF files, which may be a Redis bug or some subtle config issues:* [BUG] AOF Compaction stopped redis/redis#10806
[QUESTION] Shrink AOF file that is being saved for a long time redis/redis#10742

To mitigate these issues, i.e., to prevent them from impacting the SLA and reduce the chance of happening, let's split the Redis instances in our "halfstack" so that we could apply different levels of HA configuration depending on the purpose.

Since we already split the connection pools for each Redis database by different purposes as follows, it is easy to just change the initialization routine of those connection pools to use different connection parameters.

Redis databases

# src/ai/backend/common/defs.py
REDIS_STAT_DB: Final = 0
REDIS_RLIM_DB: Final = 1
REDIS_LIVE_DB: Final = 2
REDIS_IMAGE_DB: Final = 3
REDIS_STREAM_DB: Final = 4

I do not plan to pre-define how to group these Redis databases but leave it to the our solution architect team to determine the desired setup per site.

Though, here are some references to understand the background:

Data persistency requirements

Needed:* STAT (statistics): if lost, it impacts the usage accounting critical for billing and etc.
LIVE (agent liveness, idle checkers): if lost, it impacts the scheduler performance and mis-terminate the running sessions depending on the idle-checker config
Not much needed:* RLIM (rate limit): volatile information, effective only for up to 15 minutes, no need to enforce it strictly on failover
STREAM (event bus): events are transient and often they are retried periodically or have another mean of synchronization.
IMAGE (per-agent image availability map): if lost, automatically reconstructed from agent heartbeats

Main load patterns

STAT, LIVE: proportional to the number of sessions
RLIM: proportional to the volume of client API requests
IMAGE: proportional to the number of cluster nodes
STREAM: proportional to the number of cluster nodes & the number of sessions
Current Implementation

The configuration of Redis connection parameters are stored in etcd and shared across the entire cluster nodes including Manager, Agent, Storage Proxy, and App Proxy.

Currently, the Redis configuration can specify only one Redis instance (either a single host:port address or a list of sentinel host:port addresses).

redis_helper_config_iv = t.Dict({
    t.Key("socket_timeout", default=5.0): t.ToFloat,
    t.Key("socket_connect_timeout", default=2.0): t.ToFloat,
    t.Key("reconnect_poll_timeout", default=0.3): t.ToFloat,
}).allow_extra("*")

redis_config_iv = t.Dict({
    t.Key("addr", default=redis_default_config["addr"]): t.Null | tx.HostPortPair,
    t.Key(  # if present, addr is ignored and service_name becomes mandatory.
        "sentinel", default=redis_default_config["sentinel"]
    ): t.Null | tx.DelimiterSeperatedList(tx.HostPortPair),
    t.Key("service_name", default=redis_default_config["service_name"]): t.Null | t.String,
    t.Key("password", default=redis_default_config["password"]): t.Null | t.String,
    t.Key(
        "redis_helper_config",
        default=redis_helper_default_config,
    ): redis_helper_config_iv,
}).allow_extra("*")

Proposed Addition

Let's extend the configuration format to additionally specify per-db instance addresses (either a single host:port address or a list of sentinel host:port addresses).

We could just allow configuring optional, additional mappings from DB index to address settings (either single or sentinels), while non-specified ones falling back to the base address settings. This way, we could keep backward compatibility with existing setups.

An example split:

Instance 1 for STAT, LIVE: persistence with AOF for HA
Instance 2 for STREAM, IMAGE: no persistence with HA
Instance 3 for RLIM: no persistence with HA

The text was updated successfully, but these errors were encountered:

achimnol added comp:common Related to Common component urgency:5 It is imperative that action be taken right away. labels Dec 4, 2024

achimnol added this to the 24.09 milestone Dec 4, 2024

achimnol removed comp:common Related to Common component urgency:5 It is imperative that action be taken right away. labels Feb 4, 2025

achimnol assigned seedspirit Feb 14, 2025

seedspirit mentioned this issue Feb 17, 2025

feat(BA-24): Split redis config for each connection pool #3725

Merged

7 tasks

HyeockJinKim closed this as completed in #3725 Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Redis configuration for each connection pool by database #3195

Split Redis configuration for each connection pool by database #3195

achimnol commented Dec 4, 2024 •

edited

Loading

Current Implementation

Split Redis configuration for each connection pool by database #3195

Split Redis configuration for each connection pool by database #3195

Comments

achimnol commented Dec 4, 2024 • edited Loading

Synopsis

Redis databases

Current Implementation

Proposed Addition

achimnol commented Dec 4, 2024 •

edited

Loading