From d2f312308673d704eea16e9b9adaccd99b6657f6 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 4 Nov 2024 11:44:56 -0500 Subject: [PATCH 1/6] Updated design doc for Upload/AssetBlob garbage collection --- ...rbage-collection-uploads-asset-blobs-1.md} | 0 ...arbage-collection-uploads-asset-blobs-2.md | 43 +++++++++++++++++++ 2 files changed, 43 insertions(+) rename doc/design/{garbage-collection-uploads-asset-blobs.md => garbage-collection-uploads-asset-blobs-1.md} (100%) create mode 100644 doc/design/garbage-collection-uploads-asset-blobs-2.md diff --git a/doc/design/garbage-collection-uploads-asset-blobs.md b/doc/design/garbage-collection-uploads-asset-blobs-1.md similarity index 100% rename from doc/design/garbage-collection-uploads-asset-blobs.md rename to doc/design/garbage-collection-uploads-asset-blobs-1.md diff --git a/doc/design/garbage-collection-uploads-asset-blobs-2.md b/doc/design/garbage-collection-uploads-asset-blobs-2.md new file mode 100644 index 000000000..b2d2b71ae --- /dev/null +++ b/doc/design/garbage-collection-uploads-asset-blobs-2.md @@ -0,0 +1,43 @@ +# Upload and Asset Blob Garbage Collection + +This document presents a design for garbage collection of uploads and asset blobs in the context of the S3 trailing delete feature. It explains the need for garbage collection and describes the scenarios of orphaned uploads and orphaned asset blobs. The implementation involves introducing a new daily task to query and delete uploads and asset blobs that meet certain criteria. The document also mentions the recoverability of uploaded data and provides a GitHub branch for the implementation. + +## Background + +Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned "Assets" (i.e. files that have been properly uploaded, have metadata, etc. but are not associated with any Dandisets) is more complex and is left for a future design document. + +## Why do we need garbage collection? + +When a user creates an asset, they send a request to the API and the API returns a series of presigned URLs for the user to perform a multipart upload to. Then, an `Upload` database row is created to track the status of the upload. When the user is done uploading their data to the presigned URLs, they must "finalize" the upload by sending a request to the API to create an `AssetBlob` out of that `Upload`. Finally, they must make one more request to actually associate this new `AssetBlob` with an `Asset`. + +### Orphaned Uploads + +If the user cancels a multipart upload partway through, or completes the multipart upload to S3 but does not "finalize" the upload, then the upload becomes "orphaned", i.e. the associated `Upload` record and S3 object remain in the database/bucket indefinitely. + +### Orphaned AssetBlobs + +In this case, assume that the user properly completes the multipart upload flow and "finalizes" the `Upload` record such that it is now an `AssetBlob`, but they do not send a request to associate the new blob with an `Asset`. That `AssetBlob` record and associated S3 object will remain in the database/bucket indefinitely. + +## Implementation Details + +We will introduce a new celery-beat task that runs daily. This task will + +- Query for and delete any uploads that are older than the multipart upload presigned URL expiration time (this is currently 7 days). +- Query for and delete any AssetBlobs that are (1) not associated with any Assets, and (2) older than 7 days. + +Due to the trailing delete lifecycle rule, the actual uploaded data will remain recoverable for up to 30 days after this deletion, after which the lifecycle rule will clear it out of the bucket permanently. + +In order to facilitate restoration of deleted data, as well as for general auditability of the garbage collection feature, a new database table will be created to store information on garbage-collection events. Rows in this new table will be garbage-collected themselves every 30 days, since that is the number of days that the trailing delete feature waits before deleting expired object versions. In other words, once the blob is no longer recoverable via trailing delete in S3, the corresponding `GarbageCollectionEvent` should be deleted as well. + +### Garbage collection event table + +```python +from django.db import models + +class GarbageCollectionEvent(models.Model): + type = models.CharField(max_length=255) + created = models.DateTimeField(auto_now_add=True) + records = models.JSONField() +``` + +Note: the `records` field is a JSON serialization of the garbage-collected QuerySet, generated using [Django’s JSON model serializer](https://docs.djangoproject.com/en/5.1/topics/serialization/#serialization-formats-json)). This gives us the minimum information needed to restore a blob. The idea is that this can be reused for garbage collection of `Assets` as well. From 199cb2f7ccf7e1e1da28e5d3db9034b858459dfb Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh <37340715+mvandenburgh@users.noreply.github.com> Date: Wed, 20 Nov 2024 09:38:02 -0500 Subject: [PATCH 2/6] Update doc/design/garbage-collection-uploads-asset-blobs-2.md Co-authored-by: Yaroslav Halchenko --- doc/design/garbage-collection-uploads-asset-blobs-2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/garbage-collection-uploads-asset-blobs-2.md b/doc/design/garbage-collection-uploads-asset-blobs-2.md index b2d2b71ae..709d98e9f 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs-2.md +++ b/doc/design/garbage-collection-uploads-asset-blobs-2.md @@ -4,7 +4,7 @@ This document presents a design for garbage collection of uploads and asset blob ## Background -Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned "Assets" (i.e. files that have been properly uploaded, have metadata, etc. but are not associated with any Dandisets) is more complex and is left for a future design document. +Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed, we are ready to implement next stages in garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of `Upload`s and `AssetBlob`s, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned `Asset`s (i.e. files that have been properly uploaded, have metadata, etc. but are not associated with any Dandisets) is more complex and is left for a future design document. ## Why do we need garbage collection? From 959e9b4c185f86bdd914459979069882794f2cf2 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Wed, 20 Nov 2024 09:42:05 -0500 Subject: [PATCH 3/6] Account for Asset GC in asset blob explanation --- doc/design/garbage-collection-uploads-asset-blobs-2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/garbage-collection-uploads-asset-blobs-2.md b/doc/design/garbage-collection-uploads-asset-blobs-2.md index 709d98e9f..58b9d2f5f 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs-2.md +++ b/doc/design/garbage-collection-uploads-asset-blobs-2.md @@ -16,7 +16,7 @@ If the user cancels a multipart upload partway through, or completes the multipa ### Orphaned AssetBlobs -In this case, assume that the user properly completes the multipart upload flow and "finalizes" the `Upload` record such that it is now an `AssetBlob`, but they do not send a request to associate the new blob with an `Asset`. That `AssetBlob` record and associated S3 object will remain in the database/bucket indefinitely. +For this case there are two scenarios - (1) the user properly completes the multipart upload flow and "finalizes" the `Upload` record such that it is now an `AssetBlob`, but they do not send a request to associate the new blob with an `Asset`, or (2) an `Asset` is garbage collected (yet to be implemented), leaving its corresponding `AssetBlob` "orphaned". In both cases, the `AssetBlob` record and associated S3 object will remain in the database/bucket indefinitely. ## Implementation Details From f587c4235e9a41374433b1e05750fdd8a5488f72 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 2 Dec 2024 10:58:51 -0500 Subject: [PATCH 4/6] Add link to implementation PR --- doc/design/garbage-collection-uploads-asset-blobs-2.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/doc/design/garbage-collection-uploads-asset-blobs-2.md b/doc/design/garbage-collection-uploads-asset-blobs-2.md index 58b9d2f5f..c878e07f0 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs-2.md +++ b/doc/design/garbage-collection-uploads-asset-blobs-2.md @@ -1,6 +1,6 @@ # Upload and Asset Blob Garbage Collection -This document presents a design for garbage collection of uploads and asset blobs in the context of the S3 trailing delete feature. It explains the need for garbage collection and describes the scenarios of orphaned uploads and orphaned asset blobs. The implementation involves introducing a new daily task to query and delete uploads and asset blobs that meet certain criteria. The document also mentions the recoverability of uploaded data and provides a GitHub branch for the implementation. +This document presents a design for garbage collection of uploads and asset blobs in the context of the S3 trailing delete feature. It explains the need for garbage collection and describes the scenarios of orphaned uploads and orphaned asset blobs. The implementation involves introducing a new daily task to query and delete uploads and asset blobs that meet certain criteria. The document also mentions the recoverability of uploaded data and provides a link to a GitHub PR providing the implementation. ## Background @@ -41,3 +41,6 @@ class GarbageCollectionEvent(models.Model): ``` Note: the `records` field is a JSON serialization of the garbage-collected QuerySet, generated using [Django’s JSON model serializer](https://docs.djangoproject.com/en/5.1/topics/serialization/#serialization-formats-json)). This gives us the minimum information needed to restore a blob. The idea is that this can be reused for garbage collection of `Assets` as well. + +## Implementation +See [PR #2087](https://github.com/dandi/dandi-archive/pull/2087) for the implementation. From 056a9a164907e8dc38b199412f912c36a48cbc38 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 6 Jan 2025 10:34:40 -0500 Subject: [PATCH 5/6] Update DB models to new design --- doc/design/garbage-collection-uploads-asset-blobs-2.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/doc/design/garbage-collection-uploads-asset-blobs-2.md b/doc/design/garbage-collection-uploads-asset-blobs-2.md index c878e07f0..82bcd806c 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs-2.md +++ b/doc/design/garbage-collection-uploads-asset-blobs-2.md @@ -27,7 +27,7 @@ We will introduce a new celery-beat task that runs daily. This task will Due to the trailing delete lifecycle rule, the actual uploaded data will remain recoverable for up to 30 days after this deletion, after which the lifecycle rule will clear it out of the bucket permanently. -In order to facilitate restoration of deleted data, as well as for general auditability of the garbage collection feature, a new database table will be created to store information on garbage-collection events. Rows in this new table will be garbage-collected themselves every 30 days, since that is the number of days that the trailing delete feature waits before deleting expired object versions. In other words, once the blob is no longer recoverable via trailing delete in S3, the corresponding `GarbageCollectionEvent` should be deleted as well. +In order to facilitate restoration of deleted data, as well as for general auditability of the garbage collection feature, two new database tables will be created to store information on garbage-collection events. Rows in this new table will be garbage-collected themselves every 30 days, since that is the number of days that the trailing delete feature waits before deleting expired object versions. In other words, once the blob is no longer recoverable via trailing delete in S3, the corresponding `GarbageCollectionEvent` and its associated `GarbageCollectionEventRecords` should be deleted as well. ### Garbage collection event table @@ -35,9 +35,12 @@ In order to facilitate restoration of deleted data, as well as for general audit from django.db import models class GarbageCollectionEvent(models.Model): + timestamp = models.DateTimeField(auto_now_add=True) type = models.CharField(max_length=255) - created = models.DateTimeField(auto_now_add=True) - records = models.JSONField() + +class GarbageCollectionEventRecord(models.Model): + event = models.ForeignKey(GarbageCollectionEvent) + record = models.JSONField() ``` Note: the `records` field is a JSON serialization of the garbage-collected QuerySet, generated using [Django’s JSON model serializer](https://docs.djangoproject.com/en/5.1/topics/serialization/#serialization-formats-json)). This gives us the minimum information needed to restore a blob. The idea is that this can be reused for garbage collection of `Assets` as well. From 466dd7dbc76180e1234483b4d4dda2f2aa2845b4 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 6 Jan 2025 10:35:56 -0500 Subject: [PATCH 6/6] Typo --- doc/design/garbage-collection-uploads-asset-blobs-2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/garbage-collection-uploads-asset-blobs-2.md b/doc/design/garbage-collection-uploads-asset-blobs-2.md index 82bcd806c..e853bf3a4 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs-2.md +++ b/doc/design/garbage-collection-uploads-asset-blobs-2.md @@ -43,7 +43,7 @@ class GarbageCollectionEventRecord(models.Model): record = models.JSONField() ``` -Note: the `records` field is a JSON serialization of the garbage-collected QuerySet, generated using [Django’s JSON model serializer](https://docs.djangoproject.com/en/5.1/topics/serialization/#serialization-formats-json)). This gives us the minimum information needed to restore a blob. The idea is that this can be reused for garbage collection of `Assets` as well. +Note: the `record` field is a JSON serialization of the garbage-collected QuerySet, generated using [Django’s JSON model serializer](https://docs.djangoproject.com/en/5.1/topics/serialization/#serialization-formats-json)). This gives us the minimum information needed to restore a blob. The idea is that this can be reused for garbage collection of `Assets` as well. ## Implementation See [PR #2087](https://github.com/dandi/dandi-archive/pull/2087) for the implementation.