-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new ADR for resharding improvements #5
base: master
Are you sure you want to change the base?
Add a new ADR for resharding improvements #5
Conversation
The specifics which we might want to refine during implementation is what exactly should This is convenient and ultimately very generic but it does mean a bit of mixed responsibility for the server implementation since it communicates both the topology information (i.e. which shards are where) as well as how to access them (such as grpc keepalive, tls, etc). We might want the synchronize call to return an object that only describes the topology and in some manner parameterizing the This could also be an implementation defined secondary layer using a central service which is responsible for only topology information and a sidecar container implementing the |
is a trade-off between the speed of rollout and the potential loss of ongoing | ||
work. Since clients automatically retry when encountering errors, losing ongoing | ||
work may be the preferred issue to address. Nonetheless, for clusters with very | ||
expensive long-running actions, this could result in significant work loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth calling out that it is possible to reduce the number of processes that need to be restarted by pointing all other components to the frontend processes (e.g., workers, the scheduler, bb-browser).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add it to the text, but I don't think it's a good pattern to encourage for workers.
Firstly, frontends already require a significant amount of performance even without proxying the worker traffic. Secondly, when workers have the topology they can restart gracefully. But when they are proxying via frontends they will instead fail some actions randomly that just happened to be accessing the storage at the time of restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more use case for resharding without restart: bb-clientd instances connected directly to the shards without a bb-storage proxy frontend in between. Because restarting client-side bb-clientd instances is typically challenging for operators of the server-side services.
0011-resharding-without-downtime.md
Outdated
string identifier = 1; | ||
|
||
// Which storage backend that the service should describe the topology for. | ||
StorageBackend storage_backend = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really relevant to provide this? If a user needs to provide different configurations for individual storage types, why not encode that as part of the identifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my initial design, but as we are returning an entire BlobAccessConfiguration
it will always differ between the non scannable and the scannable blob access configurations. I though making the storage_backend property a first class property while leaving the identifier intact for other use was preferable.
If we go down the other route that I described as a comment in the PR (returning a description of the actual topology and construct the BlobAccessConfiguration from that) then I don't see any reason for having the storage_backend as a first class member of the struct.
``` | ||
|
||
A simple implementation of this service could be a sidecar container that | ||
dynamically reads a configmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder: why have a gRPC service for this? Why can't Buildbarn binaries monitor a config map on disk and reload it as needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is also a possible implementation when running in a Kubernetes context. Instead of RemotelyDefinedBlobAccessConfiguration
we could have JsonnetBlobAccessConfiguration
which takes a path to a jsonnet file that resolves a BlobAccessConfiguration
and the service component would communicate by writing directly to the ConfigMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered retrieving jsonnet files via HTTP, to allow using standard web servers instead of a special gRPC service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the benefit of gRPC if using a gRPC stream to push configuration updates, but on the other hand that involves having states about clients on the server-side which increases complexity.
I find the use of Jsonnet files via HTTP appealing, especially when combined with a TTL expiration header in the HTTP responses.
0cadab8
to
775c8da
Compare
The latest commit contains modifications to the |
|
||
Other algorithms considered were: [Consistent | ||
hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and | ||
[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably use some discussion why those were algorithms were rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
0011-resharding-without-downtime.md
Outdated
// remote service may take into consideration when returning the | ||
// BlobAccessConfiguration. This is typically used when clients are in | ||
// different networks and should route differently. | ||
google.protobuf.Value identifier = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the concrete use case for this? Why can't the server derive the client's identity from RPC credentials?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, we could derive the identity from the RPC credentials instead.
google.protobuf.Empty inotify = 2; | ||
} | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably use some discussion on how actually swapping out the BlobAccess under the hood is going to work. LocalBlobAccess and ZipWritingBlobAccess have logic for at least flushing data upon shutdown, but that's it. This means that every new config push would end up leaking gRPC clients and file descriptors. pkg/blockdevice
also doesn't allow memory unmapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm equipped to answer that question until we've had a look at what performing the implementation would entail.
Here are some unordered random thoughts to look into:
From a consumer point of view it would be neat if Buildbarn handled all of the possible issues that could arise from reloading a configuration. From a maintenance point of view this might not be ideal or even warranted.
The key features we would want to reload are the topographical pieces of the local blob access (e.g. mirrored, sharding, readfallback, grpc and probably others). Those would then need implementation work to not leak or have any other issues. But several other configurations might not even make sense to be reloaded. We would still want everything to compose neatly, but there is no point in partially reloading a local blob access configuration.
Reuse subsets of the graph which we determine to be identical?
Have an interface for reloading the configuration which may error out (for when reloading is requested in a non-compatible manner?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My organization have a Buildbarn fork with custom BlobAccess implementations for our custom sharding algorithm, with support for reloading.
The only upstream blob access implementation we reload is for grpc, and for that we rely on pkg/grpc/deduplicating_client_factory to reuse gRPC clients. I believe we have nothing in place to avoid leaking gRPC clients for no longer used endpoints, but that is acceptable for our current resharding use cases.
Our fork introduced a global Mutex that is locked while the main routine initializes blob access configurations during startup, and also locked when configuration refreshing takes place. I’m not sure if that Mutex is really needed, but have not dared to remove it because it is hard to grasp all potential data races that could otherwise occur.
If you will add support for reloading configurations, that would be very helpful and probably allow us to simplify our custom patches! And we could someday publish them if interest in the open-source community.
0011-resharding-without-downtime.md
Outdated
map<string, string> external_variables = 2; | ||
|
||
// Methodology to use for subscribing to configuration file changes | ||
JsonnetConfigurationSubscriptionMethod subscription_method = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does a user need to provide this? Why can't we do the right thing by default?
For TLS certificates we also have some logic to reload configs from disk. Is there an opportunity to have common logic for this kind of stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the inotify behavior of subscribing to file events would be preferable. The secondary method is there in case we for some reason could not rely on the inotify events. I don't think there is a way to verify ahead of time that inotify events will trigger, but perhaps it would be better if we ran both instead of having the user chose?
We would then run inotify to act immediately and have a fallback to reading the file periodically in case we are in a situation where the inotify events are not available (for whatever reason). When performing the implementation work we should atleast see if we can share some of the logic here for with the logic for reloading tls certificates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not both? Use notify, but also do a forced read every 5 minutes. That gives the best out of the box experience.
775c8da
to
3e7d80c
Compare
// Duration to reuse the previous configuration when not able to reach | ||
// RemoteBlobAccessConfiguration.GetBlobAccessConfiguration | ||
// before the component should consider itself partitioned from the cluster | ||
// and return UNAVAILABLE for any access requests. | ||
// | ||
// Recommended value: 10s | ||
google.protobuf.Duration remote_configuration_timeout = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered supporting a time-to-live (TTL) value from the fetched configuration, telling how many seconds that particular retrieved configuration is valid? Similar to TTL in DNS records. That would allow changing the refresh interval without restarting all components. Operators could for example freely choose a propagation duration between step 2 and 3 (in the 'Context' chapter) for each resharding occasion, by first changing the refresh interval (for scenarios where unnecessary client retries is undesired).
If configuration is retrieved via HTTP, then a TTL expire value could be specified in a HTTP response header.
I would prefer a solution that doesn't require a shared filesystem among all components, such as when not using Kubernetes or when using a sharding configuration directly from bb-clientd. Additionally, I suspect that most shared filesystem solutions (e.g., NFS, CephFS) do not support inotify, correct? |
No description provided.