Add a new ADR for resharding improvements #5

meroton-benjamin · 2024-12-30T12:28:01Z

No description provided.

meroton-benjamin · 2024-12-30T12:51:42Z

The specifics which we might want to refine during implementation is what exactly should RemoteBlobAccessConfiguration.Synchronize return. Since we do use protobufs for configuration we can describe the entire configuration as a result from that call.

This is convenient and ultimately very generic but it does mean a bit of mixed responsibility for the server implementation since it communicates both the topology information (i.e. which shards are where) as well as how to access them (such as grpc keepalive, tls, etc). We might want the synchronize call to return an object that only describes the topology and in some manner parameterizing the RemotelyDefinedBlobAccessConfiguration to inject the keepalive/tls configuration into the resulting blob access configuration.

This could also be an implementation defined secondary layer using a central service which is responsible for only topology information and a sidecar container implementing the RemoteBlobAccessConfiguration.Synchronize which is context aware.

0011-resharding-without-downtime.md

EdSchouten · 2024-12-30T20:52:30Z

0011-resharding-without-downtime.md

+is a trade-off between the speed of rollout and the potential loss of ongoing
+work. Since clients automatically retry when encountering errors, losing ongoing
+work may be the preferred issue to address. Nonetheless, for clusters with very
+expensive long-running actions, this could result in significant work loss.


Worth calling out that it is possible to reduce the number of processes that need to be restarted by pointing all other components to the frontend processes (e.g., workers, the scheduler, bb-browser).

I can add it to the text, but I don't think it's a good pattern to encourage for workers.

Firstly, frontends already require a significant amount of performance even without proxying the worker traffic. Secondly, when workers have the topology they can restart gracefully. But when they are proxying via frontends they will instead fail some actions randomly that just happened to be accessing the storage at the time of restart.

One more use case for resharding without restart: bb-clientd instances connected directly to the shards without a bb-storage proxy frontend in between. Because restarting client-side bb-clientd instances is typically challenging for operators of the server-side services.

0011-resharding-without-downtime.md

EdSchouten · 2024-12-30T20:57:33Z

0011-resharding-without-downtime.md

+    string identifier = 1;
+
+    // Which storage backend that the service should describe the topology for.
+    StorageBackend storage_backend = 2;


Is it really relevant to provide this? If a user needs to provide different configurations for individual storage types, why not encode that as part of the identifier?

That was my initial design, but as we are returning an entire BlobAccessConfiguration it will always differ between the non scannable and the scannable blob access configurations. I though making the storage_backend property a first class property while leaving the identifier intact for other use was preferable.

If we go down the other route that I described as a comment in the PR (returning a description of the actual topology and construct the BlobAccessConfiguration from that) then I don't see any reason for having the storage_backend as a first class member of the struct.

0011-resharding-without-downtime.md

EdSchouten · 2024-12-30T21:00:25Z

0011-resharding-without-downtime.md

+```
+
+A simple implementation of this service could be a sidecar container that
+dynamically reads a configmap.


This makes me wonder: why have a gRPC service for this? Why can't Buildbarn binaries monitor a config map on disk and reload it as needed?

That is also a possible implementation when running in a Kubernetes context. Instead of RemotelyDefinedBlobAccessConfiguration we could have JsonnetBlobAccessConfiguration which takes a path to a jsonnet file that resolves a BlobAccessConfiguration and the service component would communicate by writing directly to the ConfigMap.

Have you considered retrieving jsonnet files via HTTP, to allow using standard web servers instead of a special gRPC service?

I understand the benefit of gRPC if using a gRPC stream to push configuration updates, but on the other hand that involves having states about clients on the server-side which increases complexity.

I find the use of Jsonnet files via HTTP appealing, especially when combined with a TTL expiration header in the HTTP responses.

meroton-benjamin · 2025-01-02T11:13:16Z

The latest commit contains modifications to the RemotelyDefineBlobAccessConfiguration according to the review comments, but also a suggestion for a JsonnetDefinedBlobAccessConfiguration I do not think we should pursue both solutions.

EdSchouten · 2025-01-04T05:12:34Z

0011-resharding-without-downtime.md

+
+Other algorithms considered were: [Consistent
+hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and
+[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf).


This could probably use some discussion why those were algorithms were rejected.

EdSchouten · 2025-01-04T05:14:12Z

0011-resharding-without-downtime.md

+    // remote service may take into consideration when returning the
+    // BlobAccessConfiguration. This is typically used when clients are in
+    // different networks and should route differently.
+    google.protobuf.Value identifier = 1;


What is the concrete use case for this? Why can't the server derive the client's identity from RPC credentials?

Ack, we could derive the identity from the RPC credentials instead.

EdSchouten · 2025-01-04T05:16:49Z

0011-resharding-without-downtime.md

+        google.protobuf.Empty inotify = 2;
+    }
+}
+```


This could probably use some discussion on how actually swapping out the BlobAccess under the hood is going to work. LocalBlobAccess and ZipWritingBlobAccess have logic for at least flushing data upon shutdown, but that's it. This means that every new config push would end up leaking gRPC clients and file descriptors. pkg/blockdevice also doesn't allow memory unmapping.

I don't think I'm equipped to answer that question until we've had a look at what performing the implementation would entail.

Here are some unordered random thoughts to look into:

From a consumer point of view it would be neat if Buildbarn handled all of the possible issues that could arise from reloading a configuration. From a maintenance point of view this might not be ideal or even warranted.

The key features we would want to reload are the topographical pieces of the local blob access (e.g. mirrored, sharding, readfallback, grpc and probably others). Those would then need implementation work to not leak or have any other issues. But several other configurations might not even make sense to be reloaded. We would still want everything to compose neatly, but there is no point in partially reloading a local blob access configuration.

Reuse subsets of the graph which we determine to be identical?

Have an interface for reloading the configuration which may error out (for when reloading is requested in a non-compatible manner?)

My organization have a Buildbarn fork with custom BlobAccess implementations for our custom sharding algorithm, with support for reloading.

The only upstream blob access implementation we reload is for grpc, and for that we rely on pkg/grpc/deduplicating_client_factory to reuse gRPC clients. I believe we have nothing in place to avoid leaking gRPC clients for no longer used endpoints, but that is acceptable for our current resharding use cases.

Our fork introduced a global Mutex that is locked while the main routine initializes blob access configurations during startup, and also locked when configuration refreshing takes place. I’m not sure if that Mutex is really needed, but have not dared to remove it because it is hard to grasp all potential data races that could otherwise occur.

If you will add support for reloading configurations, that would be very helpful and probably allow us to simplify our custom patches! And we could someday publish them if interest in the open-source community.

EdSchouten · 2025-01-04T05:18:14Z

0011-resharding-without-downtime.md

+    map<string, string> external_variables = 2;
+
+    // Methodology to use for subscribing to configuration file changes
+    JsonnetConfigurationSubscriptionMethod subscription_method = 3;


Why does a user need to provide this? Why can't we do the right thing by default?

For TLS certificates we also have some logic to reload configs from disk. Is there an opportunity to have common logic for this kind of stuff?

I think the inotify behavior of subscribing to file events would be preferable. The secondary method is there in case we for some reason could not rely on the inotify events. I don't think there is a way to verify ahead of time that inotify events will trigger, but perhaps it would be better if we ran both instead of having the user chose?

We would then run inotify to act immediately and have a fallback to reading the file periodically in case we are in a situation where the inotify events are not available (for whatever reason). When performing the implementation work we should atleast see if we can share some of the logic here for with the logic for reloading tls certificates.

Why not both? Use notify, but also do a forced read every 5 minutes. That gives the best out of the box experience.

ulrfa · 2025-01-13T07:00:41Z

0011-resharding-without-downtime.md

+    // Duration to reuse the previous configuration when not able to reach
+    // RemoteBlobAccessConfiguration.GetBlobAccessConfiguration
+    // before the component should consider itself partitioned from the cluster
+    // and return UNAVAILABLE for any access requests.
+    //
+    // Recommended value: 10s
+    google.protobuf.Duration remote_configuration_timeout = 2;


Have you considered supporting a time-to-live (TTL) value from the fetched configuration, telling how many seconds that particular retrieved configuration is valid? Similar to TTL in DNS records. That would allow changing the refresh interval without restarting all components. Operators could for example freely choose a propagation duration between step 2 and 3 (in the 'Context' chapter) for each resharding occasion, by first changing the refresh interval (for scenarios where unnecessary client retries is undesired).

If configuration is retrieved via HTTP, then a TTL expire value could be specified in a HTTP response header.

ulrfa · 2025-01-13T07:18:17Z

The latest commit contains modifications to the RemotelyDefineBlobAccessConfiguration according to the review comments, but also a suggestion for a JsonnetDefinedBlobAccessConfiguration I do not think we should pursue both solutions.

I would prefer a solution that doesn't require a shared filesystem among all components, such as when not using Kubernetes or when using a sharding configuration directly from bb-clientd.

Additionally, I suspect that most shared filesystem solutions (e.g., NFS, CephFS) do not support inotify, correct?

moroten reviewed Dec 30, 2024

View reviewed changes

0011-resharding-without-downtime.md Outdated Show resolved Hide resolved

0011-resharding-without-downtime.md Outdated Show resolved Hide resolved

EdSchouten reviewed Dec 30, 2024

View reviewed changes

meroton-benjamin force-pushed the pr/resharding-without-downtime branch from 0cadab8 to 775c8da Compare January 2, 2025 11:09

EdSchouten reviewed Jan 4, 2025

View reviewed changes

Add a new ADR for resharding improvements

3e7d80c

meroton-benjamin force-pushed the pr/resharding-without-downtime branch from 775c8da to 3e7d80c Compare January 6, 2025 14:33

ulrfa reviewed Jan 13, 2025

View reviewed changes

meroton-benjamin mentioned this pull request Feb 3, 2025

Implement sharding based on Rendezvous Hashing buildbarn/bb-storage#238

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new ADR for resharding improvements #5

Add a new ADR for resharding improvements #5

meroton-benjamin commented Dec 30, 2024

meroton-benjamin commented Dec 30, 2024

EdSchouten Dec 30, 2024

meroton-benjamin Dec 31, 2024

ulrfa Jan 10, 2025

EdSchouten Dec 30, 2024

meroton-benjamin Dec 31, 2024

EdSchouten Dec 30, 2024

meroton-benjamin Dec 31, 2024

ulrfa Jan 10, 2025

ulrfa Jan 13, 2025

meroton-benjamin commented Jan 2, 2025

EdSchouten Jan 4, 2025

meroton-benjamin Jan 6, 2025

EdSchouten Jan 4, 2025

meroton-benjamin Jan 6, 2025

EdSchouten Jan 4, 2025

meroton-benjamin Jan 6, 2025

ulrfa Jan 10, 2025

EdSchouten Jan 4, 2025

meroton-benjamin Jan 6, 2025

EdSchouten Jan 7, 2025

ulrfa Jan 13, 2025

ulrfa commented Jan 13, 2025

Add a new ADR for resharding improvements #5

Are you sure you want to change the base?

Add a new ADR for resharding improvements #5

Conversation

meroton-benjamin commented Dec 30, 2024

meroton-benjamin commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meroton-benjamin commented Jan 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulrfa commented Jan 13, 2025