From 3e7d80ca2d1ce9e63e65ade4bbbf6449771b7798 Mon Sep 17 00:00:00 2001 From: Benjamin Ingberg Date: Mon, 30 Dec 2024 13:21:35 +0100 Subject: [PATCH] Add a new ADR for resharding improvements --- 0011-resharding-without-downtime.md | 258 ++++++++++++++++++++++++++++ 1 file changed, 258 insertions(+) create mode 100644 0011-resharding-without-downtime.md diff --git a/0011-resharding-without-downtime.md b/0011-resharding-without-downtime.md new file mode 100644 index 0000000..6bf1167 --- /dev/null +++ b/0011-resharding-without-downtime.md @@ -0,0 +1,258 @@ +# Buildbarn Architecture Decision Record #11: Resharding Without Downtime + +Author: Benjamin Ingberg
+Date: 2024-12-30 + +# Context + +Resharding a Buildbarn cluster without downtime is today a multi step process +that can be done as the following steps: + +1. Deploy new storage shards. +2. Deploy a new topology with a read fallback configuration. Configure the old + topology as the primary and the new topology as the secondary. +3. When the topology from step 2 has propagated, swap the primary and secondary. +4. When your new shards have performed sufficient amount of replication, deploy + a topology without fallback configuration. +5. Once the topology from step 4 has propagated, tear down any unused shards + +This process enables live resharding of a cluster. It works because each step is +backwards compatible with the previous step. That is, accessing the blobstore +with the topology from step N-1 will resolve correctly even if some components +are already using the topology from step N. + +The exact timing and method needed to perform these steps depend on how you +orchestrate your buildbarn cluster and your retention aims. This process might +span an hour to several weeks. + +Resharding a cluster is a rare operation, so having multiple steps to achieve it +is not inherently problematic. However, without a significant amount of +automation of the cluster's meta-state there are large risks for performing it +incorrectly. + +# Issues During Resharding + +## Non-Availability of a Secondary Set of Shards + +You might not have the ability to spin up a secondary set of storage shards to +perform the switchover. This is a common situation in an on-prem environment, +where running two copies of your production environment may not be feasible. + +This is not necessarily a blocker. You can reuse shards from the old topology +in your new topology. However, this has a risk of significantly reducing your +retention time since data must be stored according to the addressing schema of +both the new and the old topology simultaneously. + +While it is possible to reduce the amount of address space that is resharded with +drained backends this requires advance planning. + +## Topology changes requires restarts + +Currently, the only way to modify the topology visible to an individual +Buildbarn component is to restart that component. While mounting a Kubernetes +ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack +logic for dynamically reloading their blob access configuration. + +A properly configured cluster can still perform rolling updates. However, there +is a trade-off between the speed of rollout and the potential loss of ongoing +work. Since clients automatically retry when encountering errors, losing ongoing +work may be the preferred issue to address. Nonetheless, for clusters with very +expensive long-running actions, this could result in significant work loss. + +The amount of restarted components can be reduced by routing traffic via +internal facing frontends which are topology aware. + +# Improvements + +## Better Overlap Between Sharding Topologies + +Currently, two different sharding topologies, even if they share nodes, will +have a small overlap between addressing schemas. This can be significantly +improved by using a different sharding algorithm. + +For this purpose we replace the implementation of +`ShardingBlobAccessConfiguration` with one that uses [Rendezvous +hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing +is a lightweight and stateless technique for distributed hash tables. It has a low +overhead with minimal disruption during resharding. + +Sharding with Rendezvous hashing gives us the following properties: + * Removing a shard is _guaranteed_ to only require resharding for the blobs + that resolved to the removed shard. + * Adding a shard will reshard any blob to the new shard with a probability of + `weight/total_weight`. + +This effectively means adding or removing a shard triggers a predictable, +minimal amount of resharding, eliminating the need for drained backends. + +``` +message ShardingBlobAccessConfiguration { + message Shard { + // unchanged + BlobAccessConfiguration backend = 1; + // unchanged + uint32 weight = 2; + } + // unchanged + uint64 hash_initialization = 1; + + // Was 'shards' an array of shards to use, has been replaced with + // 'shard_map' + reserved 2; + + // NEW: + // Shards identified by a key within the context of this sharding + // configuration. The key is a freeform string which describes the identity + // of the shard in the context of the current sharding configuration. + // Shards are chosen via Rendezvous hashing based on the digest, weight, + // key and hash_initialization of the configuration. + // + // When removing a shard from the map it is guaranteed that only blobs + // which resolved to the removed shard will get a different shard. When + // adding shards there is a weight/total_weight probability that any given + // blob will be resolved to the new shards. + map shard_map = 3; +} +``` + +### Other Algorithms + +Other algorithms considered were: [Consistent +hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and +[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf). + +Rendezvous hashing was chosen for it's simpleness of implementation, +mathematical elegance and not needing to be parameterized in configuration. + +_Of all shards pick the one with the largest `-weight / log(h)` where h is a +hash of the key/node pair mapped to `]0, 1[`._ + +There are some minor advantages and disadvantages to the other algorithms, in +particular with regards to how computationally heavy they are. However, in a +typical buildbarn cluster with dozens of shards, the amount of computation +required to pick a shard is unlikely to be relevant. + +##### Consistent Hashing + +Consistent hashing can be reduced into a special case of Rendezvous with the +advantage that it can precomputed into a sorted list. There we can binary search +for the target in `log(W)` time. This is in contrast to to Rendezvous hashing +which takes `N` time to search. + +* `N` is the number of shards +* `W` is the sum of weights in its reduced form (i.e. smallest weight is 1 and + all weights are integer) + +In practice buildbarn clusters only have a small number of shards and doing a +straight linear search with Rendezvous is faster for any reasonable cluster (say +less than 100 shards). Rendezvous is also independent on the weight of the +individual shards which improves the relative performance of Rendezvous for +shards that are not evenly weighted. + +#### Maglev + +Maglev in turn uses an iterative method that precomputes an evenly distributed +lookup table of any arbitrary size. After the precomputational step the target +shard is a simple lookup of the key modulo the size of the table. + +The disadvantage of Maglev is that it aims for minimal resharding rather than +optimal resharding. When removing a shard from the set of all shard, the +algorithm tolerates a small amount of unecessary resharding. This is the order +of a few percentiles depending on the number of shards and size of the lookup +table. + +## Handling Topology Changes Dynamically + +To reduce the amount of restarts required buildbarn needs a construct that +describes how to reload the topology. + +### RemotelyDefinedBlobAccessConfiguration + +``` +message RemotelyDefinedBlobAccessConfiguration { + // Fetch the blob access configuration from an external service + buildbarn.configuration.grpc.ClientConfiguration endpoint = 1; + + // Duration to reuse the previous configuration when not able to reach + // RemoteBlobAccessConfiguration.GetBlobAccessConfiguration + // before the component should consider itself partitioned from the cluster + // and return UNAVAILABLE for any access requests. + // + // Recommended value: 10s + google.protobuf.Duration remote_configuration_timeout = 2; +} +``` + +This configuration calls service that implements the following: + +``` +service RemoteBlobAccessConfiguration { + // A request to subscribe to the blob access configurations. + // The service should immediately return the current blob access + // configuration. After the initial blob access configuration the stream + // will push out any changes to the blob access configuration. + rpc GetBlobAccessConfiguration(GetBlobAccessConfigurationRequest) returns (stream BlobAccessConfiguration); +} + +message GetBlobAccessConfigurationRequest { + enum StorageBackend { + CAS = 0; + AC = 1; + ICAS = 2; + ISCC = 3; + FSAC = 4; + } + // Which storage backend that the service should describe the topology for. + StorageBackend storage_backend = 1; +} +``` + +A simple implementation of this service could be a sidecar container that +dynamically reads a configmap. + +A more complex implementation might: + * Read prometheus metrics and roll out updates to the correct mirror. + * Increase the number of shards when the metrics indicate that the retention + time has fallen below a desired value. + * Stagger read fallback configurations which are removed automatically after + sufficient amount of time has passed. + +### JsonnetDefinedBlobAccessConfiguration + +We refer to a secondary configuration file that will be dynamically loaded at +runtime. This configuration may be modified in which case Buildbarn will reload +it. + +``` +message JsonnetDefinedBlobAccessConfiguration { + // Path to the jsonnet file which describes the blob access configuration + // to use. This will be monitored for changes. + string path = 1; + + // External variables to use when parsing the jsonnet file. These are used + // in addition to the process environment variables which are always used. + map external_variables = 2; + + // Buildbarn will use the inotify api to watch for any changes to the + // underlying jsonnet file any libsonnet file it includes and any optional + // lock file specified. Should the inotify api not be able to notice file + // modifications Buildbarn can be configured to periodically poll for + // changes to the configuration files. + // + // This typically happens because the underlying file system does not + // support the inotify api and/or because the file is hidden by bind + // mounts. + // + // Setting this value to 0 will disable the periodic refresh interval. + google.protobuf.Duration refresh_interval = 3; + + // An optional path to a lockfile which buildbarn will check the presense + // for before before parsing the jsonnet file. Buildbarn will only parse + // the jsonnet file if the lockfile does not exist. + // + // The lockfile is there to prevent unintentional partial updates to the + // BlobAccessConfiguration. Such as when having a jsonnet file which + // imports a libsonnet file where each file gets updated separately. + string lock_file_path = 4; +} +```