Support User-Defined Object Metadata #4754

tustvold · 2023-08-30T12:40:45Z

This is a draft proposal, and likely needs more polish

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Many stores provide the ability to associate arbitrary user-defined attributes with objects, it would be useful to expose this.

Describe the solution you'd like

I would like to propose a new put_opts call, in a similar vein to the existing get_opts. This would take a PutOptions

pub struct PutOptions {
    pub metadata: HashMap<String, String>
}

Stores that can't store metadata should return an error if passed metadata, and ObjectMeta should be updated to include such metadata.

Unix systems can likely make use of xattr to store user metadata

We will likely need to restrict the key names in some manner

Describe alternatives you've considered

Additional context

#4498 also calls for some sort of put_opts style API

#4753 would benefit from this functionality

The text was updated successfully, but these errors were encountered:

tustvold · 2023-10-10T08:45:28Z

A further wrinkle is that many of the listing APIs do not return this metadata

thinkharderdev · 2023-10-20T13:59:42Z

We need this somewhat urgently (can hack around it for now but would like to unhack it asap) so I can work on this.

tustvold · 2023-10-20T14:04:40Z

Can you perhaps expand on your use-case, I'm not sure about the API as originally proposed by this ticket, and was considering instead providing a mechanism similar to what we provide for content type

thinkharderdev · 2023-10-20T14:12:47Z

We need to read/write objects tags from S3 (and soon other cloud providers). I was planning on spending some time looking at the relevant Cloud provider APIs and seeing what a reasonable way to do this would be. I know with S3 at least it's a little bit annoying as you can set tags in the PutObject calls but neither GetObject nor ListObjects return the tags.

tustvold · 2023-10-20T14:17:52Z

read/write objects tag

As in https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html or metadata - https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html

They're separate things, and part of why I'm not sure about exposing this

We need to read/write objects tags from S3

Can you provide any context on why you need to read tags?
Are the tags you wish to write static or do they vary based on request
If they vary do they do so based on path or extension in a predictable manner?

thinkharderdev · 2023-10-20T14:33:56Z

Object tagging as in https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html

Can you provide any context on why you need to read tags?

We use tags to drive retention policies

Are the tags you wish to write static or do they vary based on request

There is a static set of tags but which tags get applied to any given object is dynamic

If they vary do they do so based on path or extension in a predictable manner?

No, it would not be possible to do this based on some static rules. It would have to be a mechanism that allows tagging of individual put requests.

I'm also a little hesitant to try and abstract this as there are a lot of subtle differences between APIs so it would be a little bit hard to make sure the default ObjectStore implementations work across providers. That said, adding a maximally flexible API interface at least allows custom implementations that can do whatever they want. So something as simple as what you proposed in the ticket might be ok even if the exact semantics are not consistent across different object storage APIs.

Alternatively, maybe we could punt on the whole issue by providing a canonical way to extend the ObjectStore interface. Something like (and just spitballing here :))

pub trait ObjectStoreExt {
  fn as_any(&self) -> &dyn Any // Just need to allow for downcasting to concrete type
}


pub trait ObjectStore {
  type Ext: ObjectStoreExt

  fn extension(&self) -> Ext; 
}

Then there could be standard extansions in the default impl:

pub struct AwsObjectStoreExt {
  async fn get_tags(&self, path: &Path) -> Result<HashMap<String,String>>

  async fn put_tags(&self, path: &Path, tags: &HashMap<String,String>) 
}

tustvold · 2023-10-20T14:46:59Z

I'm also a little hesitant to try and abstract this as there are a lot of subtle differences between APIs

Yeah, GCS doesn't even have a notion of tags, only metadata 😄

Alternatively, maybe we could punt on the whole issue by providing a canonical way to extend the ObjectStore interface.

I mean it isn't ideal but we do provide https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3.html#method.credentials and https://docs.rs/object_store/latest/object_store/aws/struct.AwsAuthorizer.html which would let you fairly easily construct your own requests, including https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectTagging.html

thinkharderdev · 2023-10-20T15:28:59Z

Yeah, GCS doesn't even have a notion of tags, only metadata

Right, but it may not really be an issue as long as the semantics are internally consistent within a provider. When it's unclear where to put the metadata (like in the case of AWS) that should be manageable through configuration.

It's annoying that semantics are different between providers but that is what it is. I think something like:

pub struct PutOptions {
    pub metadata: HashMap<String, String>
}

pub struct ObjectMeta {
    /// The full path to the object
    pub location: Path,
    /// The last modified time
    pub last_modified: DateTime<Utc>,
    /// The size in bytes of the object
    pub size: usize,
    /// The unique identifier for the object
    ///
    /// <https://datatracker.ietf.org/doc/html/rfc9110#name-etag>
    pub e_tag: Option<String>,
    /// A version indicator for this object
    pub version: Option<String>,
    /// Key/Value metadata for this object
    pub metadata: HashMap<String,String>
}

trait ObjectStore {

  async fn put_opt(&self, location: &Path, bytes: Bytes, options: PutOptions) -> Result<PutResult>;

  fn async get_metadata(&self, location: &Path) -> Result<HashMap<String,String>>;
}

where ObjectStore::get_metadata can be used to fetch metadata which isn't included in regular Get or List requests (like with S3).

tustvold · 2023-10-20T15:34:02Z

retention policy

Are you referring to https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html or some custom system? I'm mainly interested in the importance of being able read them, as writing has a lot more potential options for achieving it that don't leak into the ObjectStore trait

Right, but it may not really be an issue as long as the semantics are internally consistent within a provider

Apart from this crate goes to great lengths to try to provide an API that is consistent across providers... 😅

thinkharderdev · 2023-10-20T15:49:44Z

Are you referring to https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html or some custom system?

Both. The data is in customer buckets and we add tags so they can manage their own retention. How they do that is up to them, we just provide the tags.

Currently we only need to write them. We can obviously work around that (and will in the immediate term) without involving the ObjectStore trait but it would be nice if we didn't have to as associating metadata with objects is used in a lot of applications.

Apart from this crate goes to great lengths to try to provide an API that is consistent across providers... 😅

Yeah, agreed but the APIs are what they are :). So we can either provide a consistent API which always works the same across providers by always doing additional API calls to grab metadata/tags (which seems like a bad idea). Or we can make the semantics around metadata depend on the provider.

Or of course we can do neither and just say that if we can't provide consistent semantics because of provider API differences then it's not going to be exposed in the ObjectStore interface. But IMO that ship has already sailed. We have ObjectStore::append even though S3 and GCS don't support append operations at all and on Azure you can only append to objects that were created as append blobs to begin with.

tustvold · 2023-10-20T15:54:37Z

We have ObjectStore::append even though S3 and GCS don't support append operations at all

This is not something we should be following, I fought very hard to not include that, and I am increasingly of the opinion we should remove it.

Or we can make the semantics around metadata depend on the provider.

Or a third option is to make these details specified at the point of creation of the ObjectStore, e.g. via some middleware system or otherwise. That way if people have requirements outside the ObjectStore trait, they can plugin at that point.

thinkharderdev · 2023-10-20T16:12:30Z

That way if people have requirements outside the ObjectStore trait, they can plugin at that point

This would all be much easier if we didn't have also deal with local filesystems :)

I'm leaning more and more towards some sort of extension mechanism. Either exposing the inner client so you can just make arbitrary API calls outside the ObjectStore interface or an extension type that can expose "extra" API operations.

tustvold · 2023-10-20T16:35:18Z

I think adding a tags block to PutOptions that is simply ignored by backends that don't support it, seems harmless to me.

I'm in the process of adding conditional put support and so will sequence this after that

tustvold · 2023-10-26T12:18:25Z

Turns out Azure doesn't even support this consistently... But then again Azure does seem to specialize in inconsistent APIs...

Specified feature is not yet supported for hierarchical namespace accounts

Edit: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-feature-support-in-storage-accounts

tustvold · 2023-10-26T12:49:43Z

Having played around with this I'm unsure how to support this consistently, stores have different restrictions on what value are valid, and support for this across the stores is wildly inconsistent, even stores from the same provider...

Taking a step back, could your use-case encode the lifecycle details in the path of the object instead?

thinkharderdev · 2023-10-26T15:48:05Z

Taking a step back, could your use-case encode the lifecycle details in the path of the object instead?

No, ultimately it's not up to us (this was a solution in place before us and would be monumentally complex to change).

Having played around with this I'm unsure how to support this consistently, stores have different restrictions on what value are valid, and support for this across the stores is wildly inconsistent, even stores from the same provider...

Why is this a problem? If a user adds incorrect metadata (values which are not allowed for whatever reason by the particular provider) then they get an error. It's no different than (for example) writing multi-part file to S3 in which case chunks need to be > 5.5MB (except for the last one). But the same limitation obviously wouldn't apply to local file systems. So at some level you have to know which provider you are using and what the individual semantics are.

tustvold · 2023-10-26T15:54:12Z

It's no different than (for example) writing multi-part file to S3 in which case chunks need to be > 5.5MB (except for the last one). But the same limitation obviously wouldn't apply to local file systems. So at some level you have to know which provider you are using and what the individual semantics are.

Because in general we try to hide these incompatibilities from you, you can't write to funky paths, the chunking for multipart upload is done for you, etc... We could add TagSets to the crate, and I have a mostly complete PR that does this, but it just seems strange to add something to the ObjectStore trait that is supported by only 1 and a half stores...

thinkharderdev · 2023-10-26T17:57:59Z

It's no different than (for example) writing multi-part file to S3 in which case chunks need to be > 5.5MB (except for the last one). But the same limitation obviously wouldn't apply to local file systems. So at some level you have to know which provider you are using and what the individual semantics are.

Because in general we try to hide these incompatibilities from you, you can't write to funky paths, the chunking for multipart upload is done for you, etc... We could add TagSets to the crate, and I have a mostly complete PR that does this, but it just seems strange to add something to the ObjectStore trait that is supported by only 1 and a half stores...

Right, and I think it's a good idea to try and hide the incompatibilities, but if the only way to do that is not add the functionality at all then it may be better to just expose the incompatibilities and let user's deal with it. I guess the "proper" way to do this would be through traits. You could have the base ObjectStore trait expose the minimal API surface area that every provider can implement. And then have other traits for stuff not supported by all providers (ObjectAppend, ObjectMetadata, etc). This is a little awkward for upstream projects like DataFusion which tend to pass around Arc<dyn ObjectStore> but maybe this can be handled dynamically as well. So maybe something like

trait ObjectAppend: ObjectStore {
  async fn append(&self, location: &Path, bytes: Bytes) -> Result<()>;
}

trait ObjectStore {
   .. regular methods

   fn as_append(&self) -> Option<Arc<dyn ObjectAppend>>;

   // or once RPIT lands on stable
   fn as_append(&self) -> Option<&impl ObjectAppend>;
}

tustvold · 2023-10-26T18:16:50Z

Yeah, that's the approach we've taken for functionality that is disjoint, e.g. the MultiPartStore and Signer traits. This is a bit of a funny one because it is additive to existing functionality, which makes adding a separate trait a bit cumbersome, as you'll have to duplicate your write logic.

My current plan is to proceed with the approach in #4999. Provided we add a config option to ignore tags, I think we'll be fine, and will allow people to always write the tags and just have them ignored if not supported

* Object tagging (#4754) * Allow disabling tagging * Rename to disable_tagging

tustvold · 2023-11-02T10:34:03Z

label_issue.py automatically added labels {'object-store'} from #4999

criccomini · 2024-06-18T00:02:47Z

Checking in here. I would like to refocus this ticket on User-Defined Metadata (not tags) as the title suggests. Much of the discussion is around object tags, which are a separate thing.

For User-Defined Metadata, I would like to implement a new Attribute called Metadata(String) that allows users to specify attributes in their put requests.

For get requests, I propose we expose the user-defined metadata the same way as other attributes, as part of the Attribute object. This could be somewhat confusing to users since there's an meta: ObjectMetadata in GetResults. I am open to alternative suggestions, but my proposed approach mirrors the way other attributes behave.

If no one objects, I would be happy to try and submit a patch for this. I talked to @Xuanwo about this briefly on Twitter and it sounds like no one is actively working on it.

criccomini · 2024-06-18T20:14:52Z

I've posted a PR for user-defined metadata here:

#5915

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Aug 30, 2023

tustvold mentioned this issue Aug 30, 2023

object-store: support for client-side encryption on S3 #4753

Open

tustvold mentioned this issue Sep 29, 2023

Conditional Put Support #4879

Closed

tustvold mentioned this issue Oct 24, 2023

Add ObjectStore::put_opts / Conditional Put (#4879) #4984

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 26, 2023

Object tagging (apache#4754)

c79dec0

tustvold mentioned this issue Oct 26, 2023

Object tagging (#4754) #4999

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 27, 2023

Object tagging (apache#4754)

631c2cb

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 27, 2023

Object tagging (apache#4754)

4a93259

tustvold closed this as completed in #4999 Oct 30, 2023

tustvold added a commit that referenced this issue Oct 30, 2023

Object tagging (#4754) (#4999)

11b2f5f

* Object tagging (#4754) * Allow disabling tagging * Rename to disable_tagging

tustvold added the object-store Object Store Interface label Nov 2, 2023

tustvold reopened this Jan 24, 2024

This was referenced Jan 24, 2024

object_store: allow setting content-type per request #5329

Closed

Add Attributes API Exposing Broader Set of Object Metadata #5334

Open

tustvold mentioned this issue Apr 15, 2024

Add Attributes API (#5329) #5650

Merged

criccomini mentioned this issue Apr 25, 2024

Read SsTableInfo without loading entire SST slatedb/slatedb#16

Closed

This was referenced Jun 18, 2024

Support user-defined metadata in object_store slatedb/slatedb#70

Closed

Add user defined metadata #5915

Merged

tustvold closed this as completed in #5915 Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support User-Defined Object Metadata #4754

Support User-Defined Object Metadata #4754

tustvold commented Aug 30, 2023 •

edited

Loading

tustvold commented Oct 10, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023 •

edited

Loading

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023 •

edited

Loading

tustvold commented Oct 26, 2023 •

edited

Loading

tustvold commented Oct 26, 2023 •

edited

Loading

thinkharderdev commented Oct 26, 2023

tustvold commented Oct 26, 2023

thinkharderdev commented Oct 26, 2023

tustvold commented Oct 26, 2023

tustvold commented Nov 2, 2023

criccomini commented Jun 18, 2024 •

edited

Loading

criccomini commented Jun 18, 2024

Support User-Defined Object Metadata #4754

Support User-Defined Object Metadata #4754

Comments

tustvold commented Aug 30, 2023 • edited Loading

tustvold commented Oct 10, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023 • edited Loading

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023

thinkharderdev commented Oct 20, 2023

tustvold commented Oct 20, 2023 • edited Loading

tustvold commented Oct 26, 2023 • edited Loading

tustvold commented Oct 26, 2023 • edited Loading

thinkharderdev commented Oct 26, 2023

tustvold commented Oct 26, 2023

thinkharderdev commented Oct 26, 2023

tustvold commented Oct 26, 2023

tustvold commented Nov 2, 2023

criccomini commented Jun 18, 2024 • edited Loading

criccomini commented Jun 18, 2024

tustvold commented Aug 30, 2023 •

edited

Loading

tustvold commented Oct 20, 2023 •

edited

Loading

tustvold commented Oct 20, 2023 •

edited

Loading

tustvold commented Oct 26, 2023 •

edited

Loading

tustvold commented Oct 26, 2023 •

edited

Loading

criccomini commented Jun 18, 2024 •

edited

Loading