Support AWS S3 access #25

zimeon · 2020-07-21T02:51:10Z

See pyfilesystem2 branch for work to change over to use PyFilesystem for all file access. This should enable the code to work with regular OS filesystems, S3 and Zipped filesystems among others.

The text was updated successfully, but these errors were encountered:

zimeon · 2020-07-21T02:55:00Z

Have a version running that passes all tests with regular OS filesystem access and using Zipped sets of files (useful for test fixtures with empty dirs that git doesn't support).

Found a gotcha with S3 support via S3FS in that it assumes that there are "directory objects" to help simulate a filesystem (noted in https://fs-s3fs.readthedocs.io/en/latest/#limitations). However, it would be good to be able to validate OCFL objects and storage roots on S3 that do not include ""directory objects". Solution may be to create the S3FS object with strict=False but that doesn't work with the standard open_fs(...) filesystem opener, see PyFilesystem/s3fs#65 (comment)

zimeon · 2020-07-23T20:53:12Z

Maybe it is possible to pass the strict=False into the generic opener by adding a query parameter strict=0 which is parsed and then used in the S3FS version of open_fs https://github.com/PyFilesystem/s3fs/blob/master/fs_s3fs/opener.py#L23-L27

zimeon · 2020-08-03T15:03:33Z

Have merged in pyfilesystem2 branch as version 1.1.0. Needs some more work to tidy the pyfs code, especially the new version of walk that avoids using scandir in ocfl/pyfs

awoods · 2024-11-22T23:16:26Z

We are currently designing a process to validate ~11M OCFL objects that are stored behind an S3 interface. It would be ideal if we were able to perform OCFL object-level validation using ocfl-py without requiring copying of the objects to local disk first. Assuming access to large-scale memory for this validation process, is it conceivable for ocfl-py to be enhanced to support this use case?

zimeon · 2024-11-26T16:45:34Z

I think the current version 1.3.0 on pypi might actually work. For a single object you could try something like:

python ocfl-validate.py s3://bucket/path_to_object_root

I did write a sketchy version of walk that uses pyfilesystem2 so even a bulk validation might work though I think trying to do a run over 11M objects without some means of checkpoint/restart would be a frustrating experience.

I am in the middle of working on a major refactor for v2 of ocfl-py and will hope to ensure that works on S3. It would be really great if we had an open S3 endpoint with a copy of the fixture objects... I wonder if that is something we should talk about in the editors group?

awoods · 2024-11-26T16:52:00Z

re: "open S3 endpoint with a copy of the fixture objects"
Although minimal, the first question is "how do we pay for it"?

zimeon · 2024-11-26T19:25:38Z

After a bit of time messing about, the current dev code does validate and appropriately fail respectively for two objects on S3:

ocfl-py> ./ocfl-validate.py s3://ocfl-fixtures/1.1/good-objects/spec-ex-full/
INFO:botocore.credentials:Credentials found in config file: ~/.aws/config
OCFL v1.1 Object at s3://ocfl-fixtures/1.1/good-objects/spec-ex-full/ is VALID

ocfl-py> ./ocfl-validate.py s3://ocfl-fixtures/1.1/bad-objects/E092_content_file_digest_mismatch
INFO:botocore.credentials:Credentials found in config file: ~/.aws/config
[E092a] OCFL Object root inventory manifest using digest algorithm sha512 has digest 24f950aac7b9ea9b3cb728228a0c82b67c39e96b4b344798870d5daee93e3ae5931baae8c7cacfea4b629452c38026a81d138bc7aad1af3ef7bfd5ec646d6c28 for file v1/content/test.txt which doesn't match calculated digest 1277a792c8196a2504007a40f31ed93bf826e71f16273d8503f7d3e46503d00b8d8cda0a59d6a33b9c1aebc84ea6a79f7062ee080f4a9587055a7b6fb92f5fa8 for that file (see https://ocfl.io/1.1/spec/#E092)
OCFL v1.1 Object at s3://ocfl-fixtures/1.1/bad-objects/E092_content_file_digest_mismatch is INVALID

zimeon · 2024-12-12T19:02:17Z

I'm going to close this in favor of #133 to avoid giving the impression that S3 is not supported.

zimeon added the enhancement New feature or request label Jul 21, 2020

zimeon mentioned this issue Dec 12, 2024

Add documentation about use with S3 #133

Open

zimeon closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support AWS S3 access #25

Support AWS S3 access #25

zimeon commented Jul 21, 2020

zimeon commented Jul 21, 2020

zimeon commented Jul 23, 2020

zimeon commented Aug 3, 2020

awoods commented Nov 22, 2024

zimeon commented Nov 26, 2024

awoods commented Nov 26, 2024

zimeon commented Nov 26, 2024 •

edited

Loading

zimeon commented Dec 12, 2024

Support AWS S3 access #25

Support AWS S3 access #25

Comments

zimeon commented Jul 21, 2020

zimeon commented Jul 21, 2020

zimeon commented Jul 23, 2020

zimeon commented Aug 3, 2020

awoods commented Nov 22, 2024

zimeon commented Nov 26, 2024

awoods commented Nov 26, 2024

zimeon commented Nov 26, 2024 • edited Loading

zimeon commented Dec 12, 2024

zimeon commented Nov 26, 2024 •

edited

Loading