Skip to content
This repository has been archived by the owner on Aug 28, 2020. It is now read-only.

Allow Fetch to match on a subset of qualifiers from a Push'd asset #5

Open
tomcoldrick-ct opened this issue Aug 11, 2020 · 0 comments

Comments

@tomcoldrick-ct
Copy link
Collaborator

Currently we demand that a Fetch request match all the qualifiers for a previously Pushed blob to be found, however this doesn't line up with how BuildStream plans on using the Remote Asset API. In https://gitlab.com/BuildStream/buildstream/-/issues/1274 it's suggested that BuildStream will use a different set of Qualifiers for Push and Fetch requests - fetch qualifiers being a subset of those pushed. As a result this will mean that Push'd sources will never be Fetch'd by BuildStream.

Our implementation currently uses a list of all qualifiers as part of the inputs to a hash, which is in turn used as the key for a BlobAccess. As a result, matching on just a subset of qualifiers is a non-trivial change. I'll outline a few possibilities to square this circle that have come to mind:

1. Allow a fetcher to set which qualifiers it deems important enough to hash.

As BuildStream presumably (from the discussion linked) plan to have the qualifiers which it Fetches with well-defined on a per-source type basis, we may be able to add some API to specify which qualifiers a given fetcher finds important. Given we'll likely need a whole bunch of different fetchers adapted to specific fetching cases, this isn't so bad, although it does potentially focus on this use case to the detriment of generality.

We could expose this as configuration for server operators to maintain, at the risk of ballooning configuration.

2. Hash only the URI, and mitigate against the collisions

This way we retrieve assets based only on the URI, and add additional logic in order to handle matching the qualifiers. This will keep things general, and mean that we match qualifiers consistently in all cases. However, it will clearly cause a performance decrease.

To mitigate collisions, I see two possibilities: use a form of cuckoo hashing or modify the AssetStore to store a list of Assets corresponding to the URI, along with their qualifiers. Cuckoo hashing has the benefit of allowing us to expire references automatically as part of the Put, but is more complex and means more I/O, as we have to read from the underlying blob store more. Making the AssetStore take a list of Assets will allow us to load only once, but may cause the entries to increase a lot in size, and will also mean that we have to modify the content stored in the blobstore under a single digest.

I think of these possibilities I'm leaning towards extending the AssetStore to store a list of Assets.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant