-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(filemanager): ingest_id tagging and object move tracking #585
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
dcd6af0
feat(filemanager): move_id column and get object tagging logic
mmalenic f87882b
feat(filemanager): add move_id logic to database when ingesting
mmalenic 15f778b
feat(filemanager): update attributes from existing records when inges…
mmalenic 3e26a49
refactor(filemanager): flatten nested if let conditions for better re…
mmalenic 01227cf
test(filemanager): move id tests
mmalenic 6560b5e
docs(filemanager): add explanation of object moving logic
mmalenic 3388fdc
refactor(filemanager): add correct ingest function permissions and re…
mmalenic 5cf5f71
docs(filemanager): doc improvements and clarifications
mmalenic File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
2 changes: 2 additions & 0 deletions
2
lib/workload/stateless/stacks/filemanager/database/migrations/0002_s3_ingest_id.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
alter table s3_object add column ingest_id uuid; | ||
create index ingest_id_index on s3_object (ingest_id); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
58 changes: 58 additions & 0 deletions
58
lib/workload/stateless/stacks/filemanager/docs/MOVED_OBJECTS.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Tracking moved objects | ||
|
||
The filemanager tracks records from S3 events, which do not describe how objects move from one location to another. Using | ||
S3 events alone, it's not possible to tell whether an object that has been deleted in one place and created in another is | ||
the same object that has been moved, or two different objects. This is because S3 only tracks `Created` or `Deleted` | ||
events. | ||
|
||
To track moved objects, filemanager stores additional information in S3 tags. The tag gets updated when the object | ||
is moved. This allows the filemanager to track how objects move and also allows it to copy attributes when an object | ||
is moved/copied. This process is described [below](#tagging-process). | ||
|
||
## Tagging process | ||
|
||
The process of tagging is: | ||
|
||
* When an object record is ingested, the filemanager queries `Created` events for tags. | ||
* If the tags can be retrieved, the filemanager looks for a key called `ingest_id`. The key name can be | ||
configured in the environment variable `FILEMANAGER_INGESTER_TAG_NAME`. | ||
* The tag is parsed as a UUID, and stored in the `ingest_id` column of `s3_object` for that record. | ||
* If the tag does not exist, then a new UUID is generated, and the object is tagged on S3 by calling `PutObjectTagging`. | ||
The new tag is also stored in the `ingest_id` column. | ||
* The database is also queried for any records with the same `ingest_id` so that attributes can be copied to the new record. | ||
|
||
This logic is enabled by default, but it can be switched off by setting `FILEMANAGER_INGESTER_TRACK_MOVES`. The filemanager | ||
API provides a way to query the database for records with a given `ingest_id`. | ||
|
||
## Design considerations | ||
|
||
Object tags on S3 are [limited][s3-tagging] to 10 tags per object, and each tag can only store 258 unicode characters. | ||
The filemanager avoids storing a large amount of data by using a UUID as the value of the tag, which is linked to object | ||
records that store attributes and data. | ||
|
||
The object tagging process cannot be atomic, so there is a chance for concurrency errors to occur. Tagging can also | ||
fail due to database or network errors. The filemanager only tracks `ingest_id`s if it knows that a tag has been | ||
successfully created on S3, and successfully stored in the database. If tagging fails, or it's not enabled, then the `ingest_id` | ||
column will be null. | ||
|
||
The act of tagging the object allows it to be tracked - ideally this is done as soon as possible. Currently, this happens | ||
at ingestion, however this could have performance implications. Alternative approaches should consider asynchronous tagging. | ||
For example, [`s3:ObjectTagging:*`][s3-tagging-event] events could be used for this purpose. | ||
|
||
The object tagging mechanism also doesn't differentiate between moved objects and copied objects with the same tags. | ||
If an object is copied with tags, the `ingest_id` will also be copied and the above logic will apply. | ||
|
||
## Alternative designs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NIce doc! I'd probably add another small note on the checksum approach: it can't be used if the checksums are not expected to be the same, e.g. with compression, which is a big use case for us. |
||
|
||
Alternatively, S3 object metadata could also be used to track moves using a similar mechanism. However, metadata can | ||
[only be updated][s3-metadata] by deleting and recreated the object. This process would be much more costly so tagging | ||
was chosen instead. | ||
|
||
Another approach is to compare object checksums or etags. However, this would also be limited if the checksum is not present, | ||
or if the etag was computed using a different part-size. It also fails if the checksum is not expected to be the same, | ||
for example, if tracking set of files that are compressed. Both these approaches could be used in addition to object tagging | ||
to provide more ways to track moves. | ||
|
||
[s3-tagging]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html | ||
[s3-tagging-event]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types | ||
[s3-metadata]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging is currently part of the ingestion process, right?
So there's a possibility that this may slow down the ingestion and may become an issue under heavy load?
Not for now, but if that should become the case, we could think of an async tagging strategy.
Given the option to disable tagging (or tagging failing/missing for other reasons), it would be great to think of an async tagging option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with this, it could potentially slow things down as it's done on ingestion. It's a bit tricky because the act of tagging the object conveys the information of the move - ideally this would be done as soon as possible (i.e. on ingestion). Anything async would extend the window that the object isn't tagged, meaning that the move can't be tracked. In practice this probably wouldn't make a different if the object isn't moved as soon as it's created.
There are
s3:ObjectTagging:*
events which might be good for this that I'll look into.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always challenging and tradeoff. Let's give it a shot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, tricky but that's a general "issue" with event bases systems: there's an inevitable delay/asynchronicity.
And I am not saying we should implement that now. An open ticket or comment in the code to keep track of it is perfectly fine.
To compensate for potential concurrency issues, the mentioned support of checksums, name matches, etc could be used... at least to some extend. All future considerations... all good for now!