Restrict media privacy until a referencing page is published #46

ababic · 2024-11-29T13:52:07Z

What is the context of this PR?

This is a follow-up to #42 that focusses on addressing solely public/private status of media, without attempting to tie objects to pages to aid with permission policy customisation.

It's fully functional and has lots of tests (which are worth checking out to get a feel for the intended behaviour).

A brief summary:

All media should be 'private' by default (regardless of upload route)
Media is only made public when used within the content field of a page, and the revision featuring the media is published.
If the referencing page is unpublished, any media referenced solely by that page is made private again (anything referenced by other 'live' pages remains public)

When a media item's privacy changes:

Where the storage backend supports it (S3), attempts are made to set file-level permissions to reflect the new privacy value.
Where permissions for all a media item's file are updated successfully, the file_permissions_last_set timestamp is updated for the media item, and the object's file_permissions_are_outdated() method will return False.
If any errors occurred during the update process, file_permissions_last_set is not updated, and the file_permissions_are_outdated() method will return True.
Where front-end cache invalidation is configured, purge requests will also be triggered for all relevant 'serve' or direct file URLs

How to review

Testing media privacy for a draft page

In Wagtail, create a new Information Page
Enter "Media upload test" for the title and "Test" for the summary
Add an "Image" block to the content, and use the chooser modal to upload a new image
Add a "Documents" block to the content, use the sub-block options to add a document, and use the chooser modal for that item to upload a new document
Open the preview pane for the page and inspect the HTML where the blocks are rendered.
The src value of the image rendition should look something like /images/{secure key}/2/original/{image filename}, which means it's being served by the 'serve' view.
The href value of the document link should look something like /documents/
Open the document link URL and image src URL in new tabs (they should still work).
Sign out of Wagtail.
Refresh the the tabs opened in step 9 - You should now get a "403 Forbidden" response for both.
In a new tab, Sign back in to Wagtail.
Refresh the tabs once more, and they should work again

Testing media privacy after a page has been published

Publish the "Media upload test" page created in the above steps
In the green success message on the next page, click the "View live" button to view the live/public-facing version of the page.
Inspect the HTML for the image. The src value should have changed to something like: media/images/{image filename}.fill-446x390.format-webp-.original.png.
Access Wagtail again edit the "Media upload test" page once more.
Open the preview pane and inspect the HTML for the image there too. The src value should look the same as it did on the live version.
At the bottom of the page, click the arrow next to the "Save draft" button to view the additional options, and click "Unpublish page".
On the confirmation screen, click "Yes, unpublish it" to confirm the action.
Edit the page again in Wagtail and look again at the preview pane.
Inspect the HTML for the image once more. The src value should now again look like: /images/{secure key}/2/original/{image filename}.

Testing image privacy once a draft page has been published

In Wagtail, edit the draft "Image upload test" page again, and publish it.
In the green success message on the next page, click the "View live" button to view the live/public-facing version of the page.
Inspect the HTML for the image. The src value should look something like: media/images/{image filename}.fill-446x390.format-webp-.original.png.
Access Wagtail again edit the "Image upload test" page once more.
Open the preview pane and inspect the HTML for the image there too. The src value should look the same as it did on the live version.
At the bottom of the page, click the arrow next to the "Save draft" button to view the additional options, and click "Unpublish page".
On the confirmation screen, click "Yes, unpublish it" to confirm the action.
Edit the page again in Wagtail and look again at the preview pane.
Inspect the HTML for the image once more. The src value should now again look like: /images/{secure key}/2/original/{image filename}.

Follow-up Actions

Update documentation
Ensure PRIVATE_MEDIA_PERMISSION_SETTING_MAX_WORKERS is set appropriately for all target environments
Ensure python manage.py retry_file_permission_set_attempts is set to run on a cron every 10 minutes in all target environments

nehakerung

LGTM - The doc was simple to understand and follow. I also tested as how to review suggested.

Everything was as said but for Testing media privacy after a page has been published,
media/images/{image filename}.fill-446x390.format-webp-.original.png showed as media/images/{image filename}.original.png instead. I presume this is because I didn't set any fill/format for my images so I don't believe this to be important.

Just requesting an update on hyperlink :D

docs/custom-features/media_privacy.md

MebinAbraham

Nice work.

I'm happy for it to be separate, but I suggest we disable the Media library from Admin. It will add an unnecessary layer of complexity.

MebinAbraham · 2024-12-10T11:34:55Z

cms/private_media/bulk_operations.py

+            results[file] = handler(file)
+
+    with concurrent.futures.ThreadPoolExecutor(
+        max_workers=int(settings.PRIVATE_MEDIA_PERMISSION_SETTING_MAX_WORKERS)


Minor, could be something shorter like: MEDIA_PERMISSION_MAX_WORKERS

I'd like to stick with the PRIVATE_MEDIA_ prefix, as it makes it easier to mentally map settings to apps in the project. But, I do find the current name a little confusing, so have changed to PRIVATE_MEDIA_BULK_UPDATE_MAX_WORKERS in ce09c50.

Whilst it's still a bit lengthy, I think clarity is more important when it comes to settings.

MebinAbraham · 2024-12-10T11:44:03Z

cms/private_media/bulk_operations.py

+        if intended_privacy is Privacy.PRIVATE:
+            handler = getattr(storage, "make_private", None)
+        elif intended_privacy is Privacy.PUBLIC:
+            handler = getattr(storage, "make_public", None)
+
+        if handler is None:


For private media, I would like to enforce the fact that the storage backend must have these handlers. For local dev, the handles can "log out the action". I would rather not have it pass silently if storage doesn't support it; I would like to take the approach of the storage backend must implement it.

Funnily enough, that's how I had it originally, but I changed my mind at some point. Should be very easy to restore.

Resolved in 16a05fb

MebinAbraham · 2024-12-10T12:01:47Z

cms/private_media/management/commands/retry_file_permission_set_attempts.py

+        )
+
+    def handle(self, *args: Any, **options: Any) -> None:
+        self.dry_run = options["dry_run"]  # pylint: disable=attribute-defined-outside-init


Minor, add as class attr like the publish_bundles.py command does instead of the pylint disable?. The key point is consistency.

This has been resolved

MebinAbraham · 2024-12-10T12:05:26Z

cms/private_media/management/commands/retry_file_permission_set_attempts.py

+        for model in get_private_media_models():
+            queryset = model.objects.filter(file_permissions_last_set__lt=F("privacy_last_changed"))
+            if not self.dry_run and issubclass(model, PrivateImageMixin):
+                queryset = queryset.prefetch_related("renditions")
+            permissions_outdated = list(queryset)
+            self.stdout.write(f"{len(permissions_outdated)} {model.__name__} instances have outdated file permissions.")
+            if permissions_outdated:
+                make_private = []
+                make_public = []
+                for obj in permissions_outdated:
+                    if obj.privacy is Privacy.PRIVATE:
+                        make_private.append(obj)
+                    elif obj.privacy is Privacy.PUBLIC:
+                        make_public.append(obj)
+
+                self.update_file_permissions(model, make_private, Privacy.PRIVATE)
+                self.update_file_permissions(model, make_public, Privacy.PUBLIC)


Suggestion: I think it is easier to read if we added some newlines and exit early to reduce indentation. It feels a bit crowded.

for model in get_private_media_models(): queryset = model.objects.filter(file_permissions_last_set__lt=F("privacy_last_changed")) if not self.dry_run and issubclass(model, PrivateImageMixin): queryset = queryset.prefetch_related("renditions") permissions_outdated = list(queryset) self.stdout.write(f"{len(permissions_outdated)} {model.__name__} instances have outdated file permissions.") if not permissions_outdated: return None make_private = [] make_public = [] for obj in permissions_outdated: if obj.privacy is Privacy.PRIVATE: make_private.append(obj) elif obj.privacy is Privacy.PUBLIC: make_public.append(obj) self.update_file_permissions(model, make_private, Privacy.PRIVATE) self.update_file_permissions(model, make_public, Privacy.PUBLIC)

Funnily enough, that's how I had it originally, but it was before I broke handling off to update_file_permissions, and pylint had a problem with the number of return statements. It should be okay to do that again now that update_file_permissions exists

Resolved in 7f0de81

MebinAbraham · 2024-12-10T12:08:41Z

cms/private_media/management/commands/retry_file_permission_set_attempts.py

+                self.update_file_permissions(model, make_public, Privacy.PUBLIC)
+
+    def update_file_permissions(
+        self, model_class: type["PrivateMediaMixin"], items: list["PrivateMediaMixin"], privacy: Privacy


Type annotation for args should be abstract unless the method needs it to be a list.

I definitely agree where it comes to reusable code, but in the case of management commands (which are self-contained, and only really reusable if you're defining a base class for other commands to inherit), I think being explicit adds value. Less... 'how you might use me' and more 'this is exactly how i'm being used'.

Updated in 5198943

MebinAbraham · 2024-12-11T13:50:26Z

cms/private_media/signal_handlers.py

+            .values_list("int_object_id", flat=True)
+            .distinct()
+        )
+        queryset = model_class.objects.filter(pk__in=referenced_pks, _privacy=Privacy.PRIVATE)


If referenced_pks is empty, continue to the next model instead of eventually coming out of the loop?.

The issue there is that checking the result at this point will evaluate the query (a separate database round trip), currently, all of the subqueries remain lazy until the final queryset is evaluated, and postgres handles it in one round-trip).

I suggest avoid this change based on the above reasoning. @MebinAbraham are you happy with that?

MebinAbraham · 2024-12-11T13:56:05Z

cms/private_media/signal_handlers.py

+        # Identify references from other pages and extract their IDs, so that we can
+        # figure out which of those pages is live
+        referencing_page_ids = (
+            ReferenceIndex.objects.filter(to_content_type=model_ct, to_object_id__in=referenced_pks)
+            .exclude(pk__in=references.values_list("pk", flat=True))
+            .filter(base_content_type=page_ct)
+            .annotate(page_id=Cast("object_id", output_field=IntegerField()))
+            .values_list("page_id", flat=True)
+            .distinct()
+        )
+
+        # Out of the pages that are referencing the media, identify the ids of those
+        # that are currently live
+        live_page_ids = (
+            Page.objects.filter(pk__in=referencing_page_ids)
+            .live()
+            .annotate(str_id=Cast("id", output_field=CharField()))
+            .values_list("str_id", flat=True)
+        )
+
+        # Now we can identify references from live pages only, and
+        # generate a list of media item ids that should not be made private
+        live_page_referenced_media_pks = (
+            ReferenceIndex.objects.filter(to_content_type=model_ct, to_object_id__in=referenced_pks)
+            .filter(base_content_type=page_ct, object_id__in=live_page_ids)
+            .annotate(int_id=Cast("to_object_id", output_field=IntegerField()))
+            .values_list("int_id", flat=True)
+            .distinct()
+        )
+
+        queryset = (
+            model_class.objects.all()
+            .filter(pk__in=[int(pk) for pk in referenced_pks], _privacy=Privacy.PUBLIC)
+            .exclude(pk__in=live_page_referenced_media_pks)
+        )


I'm not sure if they will be very performant in the long term. One to revisit, as this might be quite large in the future. Unless there is a user need, it might be better to treat images as unique, so accept additional storage, meaning we can replace this involved check with a more simple one; worst case, it should only allow reusability within the same model, not across models.

This has now been refactored so that the potential list of pages only includes those that reference the same media referenced by the current object. If, like you say, that case is rare, it should be very small indeed.

Its always better for important code like this to assume as little as possible about what other things might be doing, so even if we did manage to somehow restrict media use to a single object (I currently have no idea how that could be achieved nicely), I'd still very much recommend keeping this.

Where the queryset result isn't explicitly evaluated to a list/set beforehand, Django usually keeps everything lazy and has the database evaluate everything in one go, so this is really quite performant (there should be some assertNumQueries items in the tests to confirm this). So, it isn't that bad at all performance-wise.

@MebinAbraham Happy to mark this as resolved based on the above explaination?

MebinAbraham · 2024-12-11T21:38:36Z

cms/private_media/views.py

+        if image.is_public and not image.file_permissions_are_outdated():
+            return redirect(rendition.file.url)
+
+        # Block access to private images if the user has insufficient permissions
+        user = self.request.user
+        if image.is_private and (
+            not user.is_authenticated
+            or not permission_policy.user_has_any_permission_for_instance(user, ["choose", "add", "change"], image)
+        ):
+            raise PermissionDenied


We need to ensure that the file can still be served if the image is private but has public pages referencing it. For example, if a page is published but the ACLs are not properly set due to network issues etc, the application should still serve the image until the next scheduled task attempts to fix and retry the ACL settings. We cannot have a page that features broken assets. The same principle applies to documents.

This is covered by signal handlers. If a page features any media that is private, it is made public when those changes are published, and whilst the page remains published, media referenced by it will always remain public.

Also, the _privacy field should always be the canonical indicator of privacy for a media item. If we're having to make additional queries at the time of serving to figure out if that is really the case, then we've failed terribly.

if a page is published but the ACLs are not properly set due to network issues etc, the application should still serve the image until the next scheduled task attempts to fix and retry the ACL settings.

This particular case is handled by the last line of the method. The image is 'public' according to the system, so won't be permission-checked. It'll just be served, using the server's AWS creds to stream the file contents.

MebinAbraham · 2024-12-11T21:40:02Z

cms/private_media/wagtail_hooks.py

+def protect_private_documents(document: "PrivateDocumentMixin", request: "HttpRequest") -> None:
+    """Block access to private documents if the user has insufficient permissions."""
+    if document.is_private and (
+        not request.user.is_authenticated
+        or not permission_policy.user_has_any_permission_for_instance(
+            request.user, ["choose", "add", "change"], document
+        )
+    ):


As above, we need to be able to serve even if it is private just incase ACL set failed.

Documents are different to images in that they are ALWAYS served by the view, regardless of privacy values. So, in essence, the ACL is irrelevant here, as the storage backend uses its credentials to stream the file from AWS

MebinAbraham · 2024-12-11T21:43:44Z

cms/private_media/storages.py

+logger = logging.getLogger(__name__)
+
+
+class PrivacySettingS3Storage(S3Storage):


Nit: I don't find PrivacySetting very clear; what about something like PrivacyAwareS3Storage or AccessControlledS3Storage

Ah, naming things!! I like the 2nd suggestion. Let's go with that

Resolved in 60fdb15

…DATE_MAX_WORKERS

…t the necessary methods

…ment a version of FileSystemStorage to use in local dev (and by extension, tests)

…hanges

…pts command

…provided

…ernate in-memory storages to reduce need for mocking in tests

…eMixin.get_privacy_controlled_serve_urls() to return an empty iterable

* main: Wire in the quote block (#59) Introduce Megalinter (#40) Add CI job to add coverage to GH Action summary (#58) # Conflicts: # docs/index.md # poetry.lock

ababic requested a review from a team as a code owner November 29, 2024 13:52

ababic marked this pull request as draft November 29, 2024 13:52

ababic force-pushed the feature/private-media branch 21 times, most recently from 3ba9bfb to b856066 Compare December 2, 2024 17:11

ababic marked this pull request as ready for review December 2, 2024 17:16

ababic force-pushed the feature/private-media branch from b856066 to 9a52eb5 Compare December 3, 2024 07:14

ababic added 5 commits December 3, 2024 07:17

Implement simplified private_media app

6fb93bd

Add to image and document models

9f912b9

Enable custom image serve view

8aad840

Fix unrelated linter complain on user model

b30bd3b

Purge serve and file URLs when the privacy of an object changes

4dd76f3

ababic added 2 commits December 10, 2024 16:47

Add notes on how private and public media items behave

fe260b8

Remove excess whitespace

593af11

nehakerung requested changes Dec 11, 2024

View reviewed changes

docs/custom-features/media_privacy.md Show resolved Hide resolved

Merge branch 'main' into feature/private-media

3a66cd0

nehakerung approved these changes Dec 11, 2024

View reviewed changes

MebinAbraham reviewed Dec 11, 2024

View reviewed changes

ababic added 8 commits December 13, 2024 12:26

PRIVATE_MEDIA_PERMISSION_SETTING_MAX_WORKERS -> PRIVATE_MEDIA_BULK_UP…

ce09c50

…DATE_MAX_WORKERS

Raise an exception when the active storage backends does not implemen…

16a05fb

…t the necessary methods

Rename PrivacySettingS3Storage to AccessControlledS3Storage and imple…

60fdb15

…ment a version of FileSystemStorage to use in local dev (and by extension, tests)

Update tests to account for logging location and storage class name c…

2f5dd30

…hanges

intended_privacy -> privacy

dee1193

Improve readability in PrivateMediaMixin.save()

50f131d

Raise storage exceptions outside of threadpool excecution

f3af123

Placate linters

b581823

ababic force-pushed the feature/private-media branch from 74dae78 to b581823 Compare December 13, 2024 13:15

ababic added 2 commits December 13, 2024 13:23

Return early to reduce indentation in retry_file_permission_set_attem…

7f0de81

…pts command

Use abstract Collection type in update_file_permissions() annotation

5198943

ababic force-pushed the feature/private-media branch from 19376c0 to 5198943 Compare December 13, 2024 13:23

Placate mypy

89a670f

ababic marked this pull request as draft December 13, 2024 14:02

ababic added 2 commits December 13, 2024 15:24

Fix bulk_operations not setting file permissions when a generator is …

d5f2ecc

…provided

Simplify logging in AccessControlLoggingFileSystemStorage and use alt…

faefcac

…ernate in-memory storages to reduce need for mocking in tests

ababic force-pushed the feature/private-media branch from 021a9dd to faefcac Compare December 13, 2024 15:37

ababic added 4 commits December 13, 2024 15:40

Update docs to reflect logging format changes

0bf2898

Revert: Don't cache ServeImage view responses, and update PrivateImag…

c196dee

…eMixin.get_privacy_controlled_serve_urls() to return an empty iterable

Simplify PrivateImageMixin.get_privacy_controlled_serve_urls()

e164276

Merge branch 'main' into feature/private-media

aaea1e9

* main: Wire in the quote block (#59) Introduce Megalinter (#40) Add CI job to add coverage to GH Action summary (#58) # Conflicts: # docs/index.md # poetry.lock

ababic marked this pull request as ready for review December 13, 2024 15:52

ababic and others added 2 commits December 13, 2024 15:52

Merge branch 'main' into feature/private-media

cfe3672

Fix rendition purge urls test

6284741

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict media privacy until a referencing page is published #46

Restrict media privacy until a referencing page is published #46

ababic commented Nov 29, 2024 •

edited

Loading

nehakerung left a comment

MebinAbraham left a comment

MebinAbraham Dec 10, 2024

ababic Dec 13, 2024

MebinAbraham Dec 10, 2024 •

edited

Loading

ababic Dec 12, 2024 •

edited

Loading

ababic Dec 13, 2024

MebinAbraham Dec 10, 2024 •

edited

Loading

ababic Dec 13, 2024

MebinAbraham Dec 10, 2024

ababic Dec 12, 2024

ababic Dec 13, 2024

MebinAbraham Dec 10, 2024

ababic Dec 12, 2024 •

edited

Loading

ababic Dec 13, 2024

MebinAbraham Dec 11, 2024

ababic Dec 12, 2024

ababic Dec 13, 2024

MebinAbraham Dec 11, 2024

ababic Dec 12, 2024 •

edited

Loading

ababic Dec 13, 2024

MebinAbraham Dec 11, 2024

ababic Dec 12, 2024

ababic Dec 12, 2024 •

edited

Loading

MebinAbraham Dec 11, 2024

ababic Dec 12, 2024 •

edited

Loading

MebinAbraham Dec 11, 2024

ababic Dec 12, 2024

ababic Dec 13, 2024

		logger = logging.getLogger(__name__)


		class PrivacySettingS3Storage(S3Storage):

Restrict media privacy until a referencing page is published #46

Are you sure you want to change the base?

Restrict media privacy until a referencing page is published #46

Conversation

ababic commented Nov 29, 2024 • edited Loading

What is the context of this PR?

A brief summary:

When a media item's privacy changes:

How to review

Testing media privacy for a draft page

Testing media privacy after a page has been published

Testing image privacy once a draft page has been published

Follow-up Actions

nehakerung left a comment

Choose a reason for hiding this comment

MebinAbraham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MebinAbraham Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

ababic Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MebinAbraham Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ababic Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ababic Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ababic Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ababic Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ababic commented Nov 29, 2024 •

edited

Loading

MebinAbraham Dec 10, 2024 •

edited

Loading

ababic Dec 12, 2024 •

edited

Loading

MebinAbraham Dec 10, 2024 •

edited

Loading

ababic Dec 12, 2024 •

edited

Loading

ababic Dec 12, 2024 •

edited

Loading

ababic Dec 12, 2024 •

edited

Loading

ababic Dec 12, 2024 •

edited

Loading