Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict media privacy until a referencing page is published #46

Open
wants to merge 62 commits into
base: main
Choose a base branch
from

Conversation

ababic
Copy link

@ababic ababic commented Nov 29, 2024

What is the context of this PR?

This is a follow-up to #42 that focusses on addressing solely public/private status of media, without attempting to tie objects to pages to aid with permission policy customisation.

It's fully functional and has lots of tests (which are worth checking out to get a feel for the intended behaviour).

A brief summary:

  • All media should be 'private' by default (regardless of upload route)
  • Media is only made public when used within the content field of a page, and the revision featuring the media is published.
  • If the referencing page is unpublished, any media referenced solely by that page is made private again (anything referenced by other 'live' pages remains public)

When a media item's privacy changes:

  • Where the storage backend supports it (S3), attempts are made to set file-level permissions to reflect the new privacy value.
  • Where permissions for all a media item's file are updated successfully, the file_permissions_last_set timestamp is updated for the media item, and the object's file_permissions_are_outdated() method will return False.
  • If any errors occurred during the update process, file_permissions_last_set is not updated, and the file_permissions_are_outdated() method will return True.
  • Where front-end cache invalidation is configured, purge requests will also be triggered for all relevant 'serve' or direct file URLs

How to review

Testing media privacy for a draft page

  1. In Wagtail, create a new Information Page
  2. Enter "Media upload test" for the title and "Test" for the summary
  3. Add an "Image" block to the content, and use the chooser modal to upload a new image
  4. Add a "Documents" block to the content, use the sub-block options to add a document, and use the chooser modal for that item to upload a new document
  5. Open the preview pane for the page and inspect the HTML where the blocks are rendered.
  6. The src value of the image rendition should look something like /images/{secure key}/2/original/{image filename}, which means it's being served by the 'serve' view.
  7. The href value of the document link should look something like /documents/
  8. Open the document link URL and image src URL in new tabs (they should still work).
  9. Sign out of Wagtail.
  10. Refresh the the tabs opened in step 9 - You should now get a "403 Forbidden" response for both.
  11. In a new tab, Sign back in to Wagtail.
  12. Refresh the tabs once more, and they should work again

Testing media privacy after a page has been published

  1. Publish the "Media upload test" page created in the above steps
  2. In the green success message on the next page, click the "View live" button to view the live/public-facing version of the page.
  3. Inspect the HTML for the image. The src value should have changed to something like: media/images/{image filename}.fill-446x390.format-webp-.original.png.
  4. Access Wagtail again edit the "Media upload test" page once more.
  5. Open the preview pane and inspect the HTML for the image there too. The src value should look the same as it did on the live version.
  6. At the bottom of the page, click the arrow next to the "Save draft" button to view the additional options, and click "Unpublish page".
  7. On the confirmation screen, click "Yes, unpublish it" to confirm the action.
  8. Edit the page again in Wagtail and look again at the preview pane.
  9. Inspect the HTML for the image once more. The src value should now again look like: /images/{secure key}/2/original/{image filename}.

Testing image privacy once a draft page has been published

  1. In Wagtail, edit the draft "Image upload test" page again, and publish it.
  2. In the green success message on the next page, click the "View live" button to view the live/public-facing version of the page.
  3. Inspect the HTML for the image. The src value should look something like: media/images/{image filename}.fill-446x390.format-webp-.original.png.
  4. Access Wagtail again edit the "Image upload test" page once more.
  5. Open the preview pane and inspect the HTML for the image there too. The src value should look the same as it did on the live version.
  6. At the bottom of the page, click the arrow next to the "Save draft" button to view the additional options, and click "Unpublish page".
  7. On the confirmation screen, click "Yes, unpublish it" to confirm the action.
  8. Edit the page again in Wagtail and look again at the preview pane.
  9. Inspect the HTML for the image once more. The src value should now again look like: /images/{secure key}/2/original/{image filename}.

Follow-up Actions

  • Update documentation
  • Ensure PRIVATE_MEDIA_PERMISSION_SETTING_MAX_WORKERS is set appropriately for all target environments
  • Ensure python manage.py retry_file_permission_set_attempts is set to run on a cron every 10 minutes in all target environments

@ababic ababic requested a review from a team as a code owner November 29, 2024 13:52
@ababic ababic marked this pull request as draft November 29, 2024 13:52
@ababic ababic force-pushed the feature/private-media branch 21 times, most recently from 3ba9bfb to b856066 Compare December 2, 2024 17:11
@ababic ababic marked this pull request as ready for review December 2, 2024 17:16
@ababic ababic force-pushed the feature/private-media branch from b856066 to 9a52eb5 Compare December 3, 2024 07:14
Copy link
Contributor

@nehakerung nehakerung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - The doc was simple to understand and follow. I also tested as how to review suggested.

Everything was as said but for Testing media privacy after a page has been published,
media/images/{image filename}.fill-446x390.format-webp-.original.png showed as media/images/{image filename}.original.png instead. I presume this is because I didn't set any fill/format for my images so I don't believe this to be important.

Just requesting an update on hyperlink :D

docs/custom-features/media_privacy.md Show resolved Hide resolved
Copy link
Contributor

@MebinAbraham MebinAbraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

I'm happy for it to be separate, but I suggest we disable the Media library from Admin. It will add an unnecessary layer of complexity.

results[file] = handler(file)

with concurrent.futures.ThreadPoolExecutor(
max_workers=int(settings.PRIVATE_MEDIA_PERMISSION_SETTING_MAX_WORKERS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, could be something shorter like: MEDIA_PERMISSION_MAX_WORKERS

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to stick with the PRIVATE_MEDIA_ prefix, as it makes it easier to mentally map settings to apps in the project. But, I do find the current name a little confusing, so have changed to PRIVATE_MEDIA_BULK_UPDATE_MAX_WORKERS in ce09c50.

Whilst it's still a bit lengthy, I think clarity is more important when it comes to settings.

Comment on lines 36 to 41
if intended_privacy is Privacy.PRIVATE:
handler = getattr(storage, "make_private", None)
elif intended_privacy is Privacy.PUBLIC:
handler = getattr(storage, "make_public", None)

if handler is None:
Copy link
Contributor

@MebinAbraham MebinAbraham Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For private media, I would like to enforce the fact that the storage backend must have these handlers. For local dev, the handles can "log out the action". I would rather not have it pass silently if storage doesn't support it; I would like to take the approach of the storage backend must implement it.

Copy link
Author

@ababic ababic Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough, that's how I had it originally, but I changed my mind at some point. Should be very easy to restore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 16a05fb

)

def handle(self, *args: Any, **options: Any) -> None:
self.dry_run = options["dry_run"] # pylint: disable=attribute-defined-outside-init
Copy link
Contributor

@MebinAbraham MebinAbraham Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, add as class attr like the publish_bundles.py command does instead of the pylint disable?. The key point is consistency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been resolved

Comment on lines 29 to 45
for model in get_private_media_models():
queryset = model.objects.filter(file_permissions_last_set__lt=F("privacy_last_changed"))
if not self.dry_run and issubclass(model, PrivateImageMixin):
queryset = queryset.prefetch_related("renditions")
permissions_outdated = list(queryset)
self.stdout.write(f"{len(permissions_outdated)} {model.__name__} instances have outdated file permissions.")
if permissions_outdated:
make_private = []
make_public = []
for obj in permissions_outdated:
if obj.privacy is Privacy.PRIVATE:
make_private.append(obj)
elif obj.privacy is Privacy.PUBLIC:
make_public.append(obj)

self.update_file_permissions(model, make_private, Privacy.PRIVATE)
self.update_file_permissions(model, make_public, Privacy.PUBLIC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I think it is easier to read if we added some newlines and exit early to reduce indentation. It feels a bit crowded.

        for model in get_private_media_models():
            queryset = model.objects.filter(file_permissions_last_set__lt=F("privacy_last_changed"))
            if not self.dry_run and issubclass(model, PrivateImageMixin):
                queryset = queryset.prefetch_related("renditions")
            
            permissions_outdated = list(queryset)
            self.stdout.write(f"{len(permissions_outdated)} {model.__name__} instances have outdated file permissions.")
            
            if not permissions_outdated:
                return None
            
            make_private = []
            make_public = []
            for obj in permissions_outdated:
                if obj.privacy is Privacy.PRIVATE:
                    make_private.append(obj)
                elif obj.privacy is Privacy.PUBLIC:
                    make_public.append(obj)

            self.update_file_permissions(model, make_private, Privacy.PRIVATE)
            self.update_file_permissions(model, make_public, Privacy.PUBLIC)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough, that's how I had it originally, but it was before I broke handling off to update_file_permissions, and pylint had a problem with the number of return statements. It should be okay to do that again now that update_file_permissions exists

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 7f0de81

self.update_file_permissions(model, make_public, Privacy.PUBLIC)

def update_file_permissions(
self, model_class: type["PrivateMediaMixin"], items: list["PrivateMediaMixin"], privacy: Privacy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation for args should be abstract unless the method needs it to be a list.

Copy link
Author

@ababic ababic Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree where it comes to reusable code, but in the case of management commands (which are self-contained, and only really reusable if you're defining a base class for other commands to inherit), I think being explicit adds value. Less... 'how you might use me' and more 'this is exactly how i'm being used'.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 5198943

.values_list("int_object_id", flat=True)
.distinct()
)
queryset = model_class.objects.filter(pk__in=referenced_pks, _privacy=Privacy.PRIVATE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If referenced_pks is empty, continue to the next model instead of eventually coming out of the loop?.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue there is that checking the result at this point will evaluate the query (a separate database round trip), currently, all of the subqueries remain lazy until the final queryset is evaluated, and postgres handles it in one round-trip).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest avoid this change based on the above reasoning. @MebinAbraham are you happy with that?

Comment on lines +60 to +94
# Identify references from other pages and extract their IDs, so that we can
# figure out which of those pages is live
referencing_page_ids = (
ReferenceIndex.objects.filter(to_content_type=model_ct, to_object_id__in=referenced_pks)
.exclude(pk__in=references.values_list("pk", flat=True))
.filter(base_content_type=page_ct)
.annotate(page_id=Cast("object_id", output_field=IntegerField()))
.values_list("page_id", flat=True)
.distinct()
)

# Out of the pages that are referencing the media, identify the ids of those
# that are currently live
live_page_ids = (
Page.objects.filter(pk__in=referencing_page_ids)
.live()
.annotate(str_id=Cast("id", output_field=CharField()))
.values_list("str_id", flat=True)
)

# Now we can identify references from live pages only, and
# generate a list of media item ids that should not be made private
live_page_referenced_media_pks = (
ReferenceIndex.objects.filter(to_content_type=model_ct, to_object_id__in=referenced_pks)
.filter(base_content_type=page_ct, object_id__in=live_page_ids)
.annotate(int_id=Cast("to_object_id", output_field=IntegerField()))
.values_list("int_id", flat=True)
.distinct()
)

queryset = (
model_class.objects.all()
.filter(pk__in=[int(pk) for pk in referenced_pks], _privacy=Privacy.PUBLIC)
.exclude(pk__in=live_page_referenced_media_pks)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if they will be very performant in the long term. One to revisit, as this might be quite large in the future. Unless there is a user need, it might be better to treat images as unique, so accept additional storage, meaning we can replace this involved check with a more simple one; worst case, it should only allow reusability within the same model, not across models.

Copy link
Author

@ababic ababic Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been refactored so that the potential list of pages only includes those that reference the same media referenced by the current object. If, like you say, that case is rare, it should be very small indeed.

Its always better for important code like this to assume as little as possible about what other things might be doing, so even if we did manage to somehow restrict media use to a single object (I currently have no idea how that could be achieved nicely), I'd still very much recommend keeping this.

Where the queryset result isn't explicitly evaluated to a list/set beforehand, Django usually keeps everything lazy and has the database evaluate everything in one go, so this is really quite performant (there should be some assertNumQueries items in the tests to confirm this). So, it isn't that bad at all performance-wise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MebinAbraham Happy to mark this as resolved based on the above explaination?

Comment on lines 20 to 29
if image.is_public and not image.file_permissions_are_outdated():
return redirect(rendition.file.url)

# Block access to private images if the user has insufficient permissions
user = self.request.user
if image.is_private and (
not user.is_authenticated
or not permission_policy.user_has_any_permission_for_instance(user, ["choose", "add", "change"], image)
):
raise PermissionDenied
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure that the file can still be served if the image is private but has public pages referencing it. For example, if a page is published but the ACLs are not properly set due to network issues etc, the application should still serve the image until the next scheduled task attempts to fix and retry the ACL settings. We cannot have a page that features broken assets. The same principle applies to documents.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered by signal handlers. If a page features any media that is private, it is made public when those changes are published, and whilst the page remains published, media referenced by it will always remain public.

Also, the _privacy field should always be the canonical indicator of privacy for a media item. If we're having to make additional queries at the time of serving to figure out if that is really the case, then we've failed terribly.

Copy link
Author

@ababic ababic Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a page is published but the ACLs are not properly set due to network issues etc, the application should still serve the image until the next scheduled task attempts to fix and retry the ACL settings.

This particular case is handled by the last line of the method. The image is 'public' according to the system, so won't be permission-checked. It'll just be served, using the server's AWS creds to stream the file contents.

Comment on lines +14 to +21
def protect_private_documents(document: "PrivateDocumentMixin", request: "HttpRequest") -> None:
"""Block access to private documents if the user has insufficient permissions."""
if document.is_private and (
not request.user.is_authenticated
or not permission_policy.user_has_any_permission_for_instance(
request.user, ["choose", "add", "change"], document
)
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, we need to be able to serve even if it is private just incase ACL set failed.

Copy link
Author

@ababic ababic Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documents are different to images in that they are ALWAYS served by the view, regardless of privacy values. So, in essence, the ACL is irrelevant here, as the storage backend uses its credentials to stream the file from AWS

logger = logging.getLogger(__name__)


class PrivacySettingS3Storage(S3Storage):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't find PrivacySetting very clear; what about something like PrivacyAwareS3Storage or AccessControlledS3Storage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, naming things!! I like the 2nd suggestion. Let's go with that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 60fdb15

@ababic ababic force-pushed the feature/private-media branch from 74dae78 to b581823 Compare December 13, 2024 13:15
@ababic ababic force-pushed the feature/private-media branch from 19376c0 to 5198943 Compare December 13, 2024 13:23
@ababic ababic marked this pull request as draft December 13, 2024 14:02
@ababic ababic force-pushed the feature/private-media branch from 021a9dd to faefcac Compare December 13, 2024 15:37
…eMixin.get_privacy_controlled_serve_urls() to return an empty iterable
* main:
  Wire in the quote block (#59)
  Introduce Megalinter (#40)
  Add CI job to add coverage to GH Action summary (#58)

# Conflicts:
#	docs/index.md
#	poetry.lock
@ababic ababic marked this pull request as ready for review December 13, 2024 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants