Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Redis configuration for each connection pool by database #3195

Closed
achimnol opened this issue Dec 4, 2024 · 0 comments · Fixed by #3725
Closed

Split Redis configuration for each connection pool by database #3195

achimnol opened this issue Dec 4, 2024 · 0 comments · Fixed by #3725
Assignees
Milestone

Comments

@achimnol
Copy link
Member

achimnol commented Dec 4, 2024

Synopsis

In some customer sites, we have experienced:

To mitigate these issues, i.e., to prevent them from impacting the SLA and reduce the chance of happening, let's split the Redis instances in our "halfstack" so that we could apply different levels of HA configuration depending on the purpose.

Since we already split the connection pools for each Redis database by different purposes as follows, it is easy to just change the initialization routine of those connection pools to use different connection parameters.

Redis databases

# src/ai/backend/common/defs.py
REDIS_STAT_DB: Final = 0
REDIS_RLIM_DB: Final = 1
REDIS_LIVE_DB: Final = 2
REDIS_IMAGE_DB: Final = 3
REDIS_STREAM_DB: Final = 4

I do not plan to pre-define how to group these Redis databases but leave it to the our solution architect team to determine the desired setup per site.

Though, here are some references to understand the background:

Data persistency requirements

  • Needed:* STAT (statistics): if lost, it impacts the usage accounting critical for billing and etc.

  • LIVE (agent liveness, idle checkers): if lost, it impacts the scheduler performance and mis-terminate the running sessions depending on the idle-checker config

  • Not much needed:* RLIM (rate limit): volatile information, effective only for up to 15 minutes, no need to enforce it strictly on failover

  • STREAM (event bus): events are transient and often they are retried periodically or have another mean of synchronization.

  • IMAGE (per-agent image availability map): if lost, automatically reconstructed from agent heartbeats

Main load patterns

  • STAT, LIVE: proportional to the number of sessions
  • RLIM: proportional to the volume of client API requests
  • IMAGE: proportional to the number of cluster nodes
  • STREAM: proportional to the number of cluster nodes & the number of sessions

    Current Implementation

The configuration of Redis connection parameters are stored in etcd and shared across the entire cluster nodes including Manager, Agent, Storage Proxy, and App Proxy.

Currently, the Redis configuration can specify only one Redis instance (either a single host:port address or a list of sentinel host:port addresses).

redis_helper_config_iv = t.Dict({
    t.Key("socket_timeout", default=5.0): t.ToFloat,
    t.Key("socket_connect_timeout", default=2.0): t.ToFloat,
    t.Key("reconnect_poll_timeout", default=0.3): t.ToFloat,
}).allow_extra("*")

redis_config_iv = t.Dict({
    t.Key("addr", default=redis_default_config["addr"]): t.Null | tx.HostPortPair,
    t.Key(  # if present, addr is ignored and service_name becomes mandatory.
        "sentinel", default=redis_default_config["sentinel"]
    ): t.Null | tx.DelimiterSeperatedList(tx.HostPortPair),
    t.Key("service_name", default=redis_default_config["service_name"]): t.Null | t.String,
    t.Key("password", default=redis_default_config["password"]): t.Null | t.String,
    t.Key(
        "redis_helper_config",
        default=redis_helper_default_config,
    ): redis_helper_config_iv,
}).allow_extra("*")

Proposed Addition

Let's extend the configuration format to additionally specify per-db instance addresses (either a single host:port address or a list of sentinel host:port addresses).

We could just allow configuring optional, additional mappings from DB index to address settings (either single or sentinels), while non-specified ones falling back to the base address settings. This way, we could keep backward compatibility with existing setups.

An example split:

  • Instance 1 for STAT, LIVE: persistence with AOF for HA
  • Instance 2 for STREAM, IMAGE: no persistence with HA
  • Instance 3 for RLIM: no persistence with HA
@achimnol achimnol added comp:common Related to Common component urgency:5 It is imperative that action be taken right away. labels Dec 4, 2024
@achimnol achimnol added this to the 24.09 milestone Dec 4, 2024
@achimnol achimnol removed comp:common Related to Common component urgency:5 It is imperative that action be taken right away. labels Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants