Skip to content

Commit

Permalink
Add a new ADR for resharding improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
meroton-benjamin committed Dec 30, 2024
1 parent c1016eb commit 0cadab8
Showing 1 changed file with 195 additions and 0 deletions.
195 changes: 195 additions & 0 deletions 0011-resharding-without-downtime.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Buildbarn Architecture Decision Record #11: Resharding Without Downtime

Author: Benjamin Ingberg<br/>
Date: 2024-12-30

# Context

Resharding a Buildbarn cluster without downtime is today a multi step process
that can be done as the following steps:

1. Deploy new storage shards.
2. Deploy a new topology with a read fallback configuration. Configure the old
topology as the primary and the new topology as the secondary.
3. When the topology from step 2 has propagated, swap the primary and secondary.
4. When your new shards have performed sufficient amount of replication, deploy
a topology without fallback configuration.
5. Once the topology from step 4 has propagated, tear down any unused shards

This process enables live resharding of a cluster. It works because each step is
backwards compatible with the previous step. That is, accessing the blobstore
with the topology from step N-1 will resolve correctly even if some components
are already using the topology from step N.

The exact timing and method needed to perform these steps depend on how you
orchestrate your buildbarn cluster and your retention aims. This process might
span an hour several weeks.

Resharding a cluster is a rare operation, so having multiple steps to achieve it
is not inherently problematic. However, without a significant amount of
automation of the cluster's meta-state there are large risks for performing it
incorrectly.

# Issues During Resharding

## Non-Availability of a Secondary Set of Shards

You might not have the ability to spin up a secondary set of storage shards to
perform the switchover. This is a common situation in an on-prem environment,
where running two copies of your production environment may not be feasible.

This is not necessarily a blocker. You can reuse shards from the old topology
in your new topology. However, this has a risk of significantly reducing your
retention time since data must be stored according to the addressing schema of
both the new and the old topology simultaneously.

While it is possible to reduce the amount of address space that is resharded with
drained backends this requires advance planning.

## Topology changes requires restarts

Currently, the only way to modify the topology visible to an individual
Buildbarn component is to restart that component. While mounting a Kubernetes
ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack
logic for dynamically reloading their blob access configuration.

A properly configured cluster can still perform rolling updates. However, there
is a trade-off between the speed of rollout and the potential loss of ongoing
work. Since clients automatically retry when encountering errors, losing ongoing
work may be the preferred issue to address. Nonetheless, for clusters with very
expensive long-running actions, this could result in significant work loss.

# Improvements

## Better Overlap Between Sharding Topologies

Currently, two different sharding topologies, even if they share nodes, will
have a small overlap between addressing schemas. This can be significantly
improved by using a different sharding algorithm.

For this purpose we replace the implementation of
`ShardingBlobAccessConfiguration` with one that uses [Rendezvous
hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing
is a lightweight and stateless technique for distributed hash tables. It has a low
overhead with minimal disruption during resharding.

Sharding with Rendezvous hashing gives us the following properties:
* Removing a shard is _guaranteed_ to only require resharding for the blobs
that resolved to the removed shard.
* Adding a shard will reshard any blob to the new shard with a probability of
`weight/total_weight`.

This effectively means adding or removing a shard triggers a predictable,
minimal amount of resharding, eliminating the need for drained backends.

```
message ShardingBlobAccessConfiguration {
message Shard {
// unchanged
BlobAccessConfiguration backend = 1;
// unchanged
uint32 weight = 2;
}
// unchanged
uint64 hash_initialization = 1;
// Was 'shards' an array of shards to use, has been replaced with 'shard_map'
reserved 2;
// NEW:
// Shards identified by a key within the context of this sharding
// configuration. Shards are chosen via Rendezvous hashing based on the
// digest, weight, key and hash_initialization of the configuration.
//
// When removing a shard from the map it is guaranteed that only blobs
// which resolved to the removed shard will get a different shard. When
// adding shards there is a weight/total_weight probability that any given
// blob will be resolved to the new shards.
map<string, Shard> shard_map = 3;
}
```

Other algorithms considered were: [Consistent
hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and
[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf).

## Handling Topology Changes Dynamically

To reduce the amount of restarts required a new blob access configuration is
added.

```
message RemotelyDefinedBlobAccessConfiguration {
// Fetch the blob access configuration from an external service
buildbarn.configuration.grpc.ClientConfiguration grpc = 1;
// Maximum grace time when receiving a new blob access configuration
// before cancelling ongoing requests.
//
// Recommended value: 0s for client facing systems, 60s for internal
// systems
google.protobuf.Duration maximum_grace_time = 2;
}
```

Which calls to a service that implements the following:

```
service RemoteBlobAccessConfiguration {
rpc Synchronize(SynchronizeRequest) returns (SynchronizeResponse);
}
message SynchronizeRequest {
enum StorageBackend {
CAS = 0;
AC = 1;
ICAS = 2;
ISCC = 3;
FSAC = 4;
}
// An implementation defined identifier describing who the client is which
// the remote service may take into consideration when returning the
// BlobAccessConfiguration. This is typically used when clients are in
// different networks and should route differently.
string identifier = 1;
// Which storage backend that the service should describe the topology for.
StorageBackend storage_backend = 2;
// A message describing the current state from the perspective of the
// client, the client may assume it's current topology is correct if the
// service does not respond with a request to change it.
BlobAccessConfiguration current = 3;
}
message SynchronizeResponse {
// A message describing how the component should perform blob access
// configurations.
BlobAccessConfiguration desired_state = 1;
// Latest time at which an ongoing request to the previous blob access
// configuration should be allowed to continue before the caller should
// cancel it and return an error. The client may cancel it before the
// grace_time has passed.
google.protobuf.Timestamp grace_time = 2;
// Timestamp for when this state is considered expired. The remote blob
// access configuration will avoid breaking compatibility until
// after this timestamp has passed.
google.protobuf.Timestamp response_expiry = 3;
}
```

A simple implementation of this service could be a sidecar container that
dynamically reads a configmap.

A more complex implementation might:
* Read prometheus metrics and roll out updates to the correct mirror.
* Increase the number of shards when the metrics indicate that the retention
time has fallen below a desired value.
* Stagger read fallback configurations which are removed automatically after
sufficient amount of time has passed.

Adding grace times and response expiries allows the service to set an upper
bound before a well behaved system should have propagated the topology change.
This allows it to make informed decisions on when to break compatibility.

0 comments on commit 0cadab8

Please sign in to comment.