-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a new ADR for resharding improvements
- Loading branch information
1 parent
c1016eb
commit 775c8da
Showing
1 changed file
with
218 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,218 @@ | ||
# Buildbarn Architecture Decision Record #11: Resharding Without Downtime | ||
|
||
Author: Benjamin Ingberg<br/> | ||
Date: 2024-12-30 | ||
|
||
# Context | ||
|
||
Resharding a Buildbarn cluster without downtime is today a multi step process | ||
that can be done as the following steps: | ||
|
||
1. Deploy new storage shards. | ||
2. Deploy a new topology with a read fallback configuration. Configure the old | ||
topology as the primary and the new topology as the secondary. | ||
3. When the topology from step 2 has propagated, swap the primary and secondary. | ||
4. When your new shards have performed sufficient amount of replication, deploy | ||
a topology without fallback configuration. | ||
5. Once the topology from step 4 has propagated, tear down any unused shards | ||
|
||
This process enables live resharding of a cluster. It works because each step is | ||
backwards compatible with the previous step. That is, accessing the blobstore | ||
with the topology from step N-1 will resolve correctly even if some components | ||
are already using the topology from step N. | ||
|
||
The exact timing and method needed to perform these steps depend on how you | ||
orchestrate your buildbarn cluster and your retention aims. This process might | ||
span an hour to several weeks. | ||
|
||
Resharding a cluster is a rare operation, so having multiple steps to achieve it | ||
is not inherently problematic. However, without a significant amount of | ||
automation of the cluster's meta-state there are large risks for performing it | ||
incorrectly. | ||
|
||
# Issues During Resharding | ||
|
||
## Non-Availability of a Secondary Set of Shards | ||
|
||
You might not have the ability to spin up a secondary set of storage shards to | ||
perform the switchover. This is a common situation in an on-prem environment, | ||
where running two copies of your production environment may not be feasible. | ||
|
||
This is not necessarily a blocker. You can reuse shards from the old topology | ||
in your new topology. However, this has a risk of significantly reducing your | ||
retention time since data must be stored according to the addressing schema of | ||
both the new and the old topology simultaneously. | ||
|
||
While it is possible to reduce the amount of address space that is resharded with | ||
drained backends this requires advance planning. | ||
|
||
## Topology changes requires restarts | ||
|
||
Currently, the only way to modify the topology visible to an individual | ||
Buildbarn component is to restart that component. While mounting a Kubernetes | ||
ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack | ||
logic for dynamically reloading their blob access configuration. | ||
|
||
A properly configured cluster can still perform rolling updates. However, there | ||
is a trade-off between the speed of rollout and the potential loss of ongoing | ||
work. Since clients automatically retry when encountering errors, losing ongoing | ||
work may be the preferred issue to address. Nonetheless, for clusters with very | ||
expensive long-running actions, this could result in significant work loss. | ||
|
||
The amount of restarted components can be reduced by routing traffic via | ||
internal facing frontends which are topology aware. | ||
|
||
# Improvements | ||
|
||
## Better Overlap Between Sharding Topologies | ||
|
||
Currently, two different sharding topologies, even if they share nodes, will | ||
have a small overlap between addressing schemas. This can be significantly | ||
improved by using a different sharding algorithm. | ||
|
||
For this purpose we replace the implementation of | ||
`ShardingBlobAccessConfiguration` with one that uses [Rendezvous | ||
hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing | ||
is a lightweight and stateless technique for distributed hash tables. It has a low | ||
overhead with minimal disruption during resharding. | ||
|
||
Sharding with Rendezvous hashing gives us the following properties: | ||
* Removing a shard is _guaranteed_ to only require resharding for the blobs | ||
that resolved to the removed shard. | ||
* Adding a shard will reshard any blob to the new shard with a probability of | ||
`weight/total_weight`. | ||
|
||
This effectively means adding or removing a shard triggers a predictable, | ||
minimal amount of resharding, eliminating the need for drained backends. | ||
|
||
``` | ||
message ShardingBlobAccessConfiguration { | ||
message Shard { | ||
// unchanged | ||
BlobAccessConfiguration backend = 1; | ||
// unchanged | ||
uint32 weight = 2; | ||
} | ||
// unchanged | ||
uint64 hash_initialization = 1; | ||
// Was 'shards' an array of shards to use, has been replaced with | ||
// 'shard_map' | ||
reserved 2; | ||
// NEW: | ||
// Shards identified by a key within the context of this sharding | ||
// configuration. The key is a freeform string which describes the identity | ||
// of the shard in the context of the current sharding configuration. | ||
// Shards are chosen via Rendezvous hashing based on the digest, weight, | ||
// key and hash_initialization of the configuration. | ||
// | ||
// When removing a shard from the map it is guaranteed that only blobs | ||
// which resolved to the removed shard will get a different shard. When | ||
// adding shards there is a weight/total_weight probability that any given | ||
// blob will be resolved to the new shards. | ||
map<string, Shard> shard_map = 3; | ||
} | ||
``` | ||
|
||
Other algorithms considered were: [Consistent | ||
hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and | ||
[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf). | ||
|
||
## Handling Topology Changes Dynamically | ||
|
||
To reduce the amount of restarts required buildbarn needs a construct that | ||
describes how to reload the topology. | ||
|
||
### RemotelyDefinedBlobAccessConfiguration | ||
|
||
``` | ||
message RemotelyDefinedBlobAccessConfiguration { | ||
// Fetch the blob access configuration from an external service | ||
buildbarn.configuration.grpc.ClientConfiguration endpoint = 1; | ||
// Duration to reuse the previous configuration when not able to reach | ||
// RemoteBlobAccessConfiguration.GetBlobAccessConfiguration | ||
// before the component should consider itself partitioned from the cluster | ||
// and return UNAVAILABLE for any access requests. | ||
// | ||
// Recommended value: 10s | ||
google.protobuf.Duration remote_configuration_timeout = 2; | ||
} | ||
``` | ||
|
||
This configuration calls service that implements the following: | ||
|
||
``` | ||
service RemoteBlobAccessConfiguration { | ||
// A request to subscribe to the blob access configurations. | ||
// The service should immediately return the current blob access | ||
// configuration. After the initial blob access configuration the stream | ||
// will push out any changes to the blob access configuration. | ||
rpc GetBlobAccessConfiguration(GetBlobAccessConfigurationRequest) returns (stream BlobAccessConfiguration); | ||
} | ||
message GetBlobAccessConfigurationRequest { | ||
enum StorageBackend { | ||
CAS = 0; | ||
AC = 1; | ||
ICAS = 2; | ||
ISCC = 3; | ||
FSAC = 4; | ||
} | ||
// An implementation defined identifier describing who the client is. The | ||
// remote service may take into consideration when returning the | ||
// BlobAccessConfiguration. This is typically used when clients are in | ||
// different networks and should route differently. | ||
google.protobuf.Value identifier = 1; | ||
// Which storage backend that the service should describe the topology for. | ||
StorageBackend storage_backend = 2; | ||
} | ||
``` | ||
|
||
A simple implementation of this service could be a sidecar container that | ||
dynamically reads a configmap. | ||
|
||
A more complex implementation might: | ||
* Read prometheus metrics and roll out updates to the correct mirror. | ||
* Increase the number of shards when the metrics indicate that the retention | ||
time has fallen below a desired value. | ||
* Stagger read fallback configurations which are removed automatically after | ||
sufficient amount of time has passed. | ||
|
||
### JsonnetDefinedBlobAccessConfiguration | ||
|
||
``` | ||
message JsonnetDefinedBlobAccessConfiguration { | ||
// Path to the jsonnet file which describes the blob access configuration | ||
// to use. This will be monitored for changes. | ||
string path = 1; | ||
// External variables to use when parsing the jsonnet file. These are used | ||
// in addition to the process environment variables which are always used. | ||
map<string, string> external_variables = 2; | ||
// Methodology to use for subscribing to configuration file changes | ||
JsonnetConfigurationSubscriptionMethod subscription_method = 3; | ||
// An optional path to a lockfile which buildbarn will check the presense | ||
// for before before parsing the jsonnet file. Buildbarn will only parse | ||
// the jsonnet file if the lockfile does not exist. | ||
// | ||
// The lockfile is there to prevent unintentional partial updates to the | ||
// BlobAccessConfiguration. Such as when having a jsonnet file which | ||
// imports a libsonnet file where each file gets updated separately. | ||
string lock_file_path = 4; | ||
} | ||
message JsonnetConfigurationSubscriptionMethod { | ||
oneof method { | ||
// Parse the jsonnet file continously to monitor for any changes | ||
google.protobuf.Duration polling_duration = 1; | ||
// Subscribe to the jsonnet file and any imported libsonnet files via | ||
// the inotify api. | ||
google.protobuf.Empty inotify = 2; | ||
} | ||
} | ||
``` |