Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AWS S3 access #25

Closed
zimeon opened this issue Jul 21, 2020 · 8 comments
Closed

Support AWS S3 access #25

zimeon opened this issue Jul 21, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@zimeon
Copy link
Owner

zimeon commented Jul 21, 2020

See pyfilesystem2 branch for work to change over to use PyFilesystem for all file access. This should enable the code to work with regular OS filesystems, S3 and Zipped filesystems among others.

@zimeon zimeon added the enhancement New feature or request label Jul 21, 2020
@zimeon
Copy link
Owner Author

zimeon commented Jul 21, 2020

Have a version running that passes all tests with regular OS filesystem access and using Zipped sets of files (useful for test fixtures with empty dirs that git doesn't support).

Found a gotcha with S3 support via S3FS in that it assumes that there are "directory objects" to help simulate a filesystem (noted in https://fs-s3fs.readthedocs.io/en/latest/#limitations). However, it would be good to be able to validate OCFL objects and storage roots on S3 that do not include ""directory objects". Solution may be to create the S3FS object with strict=False but that doesn't work with the standard open_fs(...) filesystem opener, see PyFilesystem/s3fs#65 (comment)

@zimeon
Copy link
Owner Author

zimeon commented Jul 23, 2020

Maybe it is possible to pass the strict=False into the generic opener by adding a query parameter strict=0 which is parsed and then used in the S3FS version of open_fs https://github.com/PyFilesystem/s3fs/blob/master/fs_s3fs/opener.py#L23-L27

@zimeon
Copy link
Owner Author

zimeon commented Aug 3, 2020

Have merged in pyfilesystem2 branch as version 1.1.0. Needs some more work to tidy the pyfs code, especially the new version of walk that avoids using scandir in ocfl/pyfs

@awoods
Copy link
Contributor

awoods commented Nov 22, 2024

We are currently designing a process to validate ~11M OCFL objects that are stored behind an S3 interface. It would be ideal if we were able to perform OCFL object-level validation using ocfl-py without requiring copying of the objects to local disk first. Assuming access to large-scale memory for this validation process, is it conceivable for ocfl-py to be enhanced to support this use case?

@zimeon
Copy link
Owner Author

zimeon commented Nov 26, 2024

I think the current version 1.3.0 on pypi might actually work. For a single object you could try something like:

python ocfl-validate.py s3://bucket/path_to_object_root

I did write a sketchy version of walk that uses pyfilesystem2 so even a bulk validation might work though I think trying to do a run over 11M objects without some means of checkpoint/restart would be a frustrating experience.

I am in the middle of working on a major refactor for v2 of ocfl-py and will hope to ensure that works on S3. It would be really great if we had an open S3 endpoint with a copy of the fixture objects... I wonder if that is something we should talk about in the editors group?

@awoods
Copy link
Contributor

awoods commented Nov 26, 2024

re: "open S3 endpoint with a copy of the fixture objects"
Although minimal, the first question is "how do we pay for it"?

@zimeon
Copy link
Owner Author

zimeon commented Nov 26, 2024

After a bit of time messing about, the current dev code does validate and appropriately fail respectively for two objects on S3:

ocfl-py> ./ocfl-validate.py s3://ocfl-fixtures/1.1/good-objects/spec-ex-full/
INFO:botocore.credentials:Credentials found in config file: ~/.aws/config
OCFL v1.1 Object at s3://ocfl-fixtures/1.1/good-objects/spec-ex-full/ is VALID

ocfl-py> ./ocfl-validate.py s3://ocfl-fixtures/1.1/bad-objects/E092_content_file_digest_mismatch
INFO:botocore.credentials:Credentials found in config file: ~/.aws/config
[E092a] OCFL Object root inventory manifest using digest algorithm sha512 has digest 24f950aac7b9ea9b3cb728228a0c82b67c39e96b4b344798870d5daee93e3ae5931baae8c7cacfea4b629452c38026a81d138bc7aad1af3ef7bfd5ec646d6c28 for file v1/content/test.txt which doesn't match calculated digest 1277a792c8196a2504007a40f31ed93bf826e71f16273d8503f7d3e46503d00b8d8cda0a59d6a33b9c1aebc84ea6a79f7062ee080f4a9587055a7b6fb92f5fa8 for that file (see https://ocfl.io/1.1/spec/#E092)
OCFL v1.1 Object at s3://ocfl-fixtures/1.1/bad-objects/E092_content_file_digest_mismatch is INVALID

@zimeon
Copy link
Owner Author

zimeon commented Dec 12, 2024

I'm going to close this in favor of #133 to avoid giving the impression that S3 is not supported.

@zimeon zimeon closed this as completed Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants