Path encoding bug #33

pwinckles · 2022-02-15T17:42:01Z

I recently discovered that the BagIt 1.0 specification requires that CR, LF, and % in file paths within manifest files are percent-encoded, and that there isn't a single BagIt implementation that does this correctly. Implementations either only encode CR and LF but not % or they encode nothing.

This implementation does not encode paths in the manifest, which means that it would fail to validate BagIt 1.0 bags that include file paths containing CR, LF, or %. Likewise, it would create bags that would fail BagIt 1.0 validation in the case that there are paths that naturally contain percent-encoded characters.

For example, let's say a bag contains the file data/file%0A1.txt. This file should be written to the manifest per the spec as data/file%250A1.txt. However, this implementation writes it as data/file%0A1.txt. This means, that when this implementation validates a properly constructed 1.0 bag it will look for the file data/file%250A1.txt which does not exist. Similarly, if another implementation that follows the spec attempts to validate a bag produced by this implementation, it would look for data/file\n1.txt, which does not exist.

It would seem desirable to me to move the ecosystem in the direction of properly implementing the 1.0 specification, while at the same acknowledging that there are a large number of 1.0 bags in existence that may then become invalid.

As such, it may be prudent to, when validating bags, fall back on a series of tests. You may want to first attempt to validate per the spec, and then, if a file cannot be found, attempt to locate it by either only decoding the CR and LF or leaving the path unchanged, ideally validating all of the files using the same method.

I have not examined fetch.txt implementations, but the same encoding requirements exist for paths in that file as well. This is potentially a thornier problem to address in a backward compatible way as it is unclear if the path data/file%250A1.txt is supposed to create data/file%250A1.txt (incorrect) or data/file%0A1.txt (correct).

Finally, I created a related ticket against the spec discussing this encoding problem, in particular how it breaks checksum utility compatibility.

The text was updated successfully, but these errors were encountered:

pwinckles · 2022-02-16T13:35:23Z

It occurred to me that another approach to backwards compatible validation would be to only decode %0D, %0A, and %25 when decoding paths in manifests. Normally, when percent-decoding, you'd decode all encoded characters as described here. However, by only decoding these three a correct BagIt 1.0 implementation would still be able to validate most bags produced by existing implementations.

For example, if a bag contains the file test%201.txt, then an existing implementation would write it to the manifest as data/test%201.txt when it should actually be data/test%25201.txt. However, if you only decode %0D, %0A, and %25, then the paths are equivalent.

This approach does not work for files that naturally include these three strings. For example, if a bag has a file named test%251.txt. Existing implementations would write it to the manifest as data/test%251.txt when it should be data/test%25251.txt. These paths are not equivalent. The first decodes to data/test%1.txt and the second decodes to data/test%251.txt.

While not perfect, I think this approach would greatly improve validation compatibility.

richardrodgers · 2022-02-16T16:13:10Z

Thanks for bringing this to my attention - I must have missed the 0.97 to 1.0 spec change in this regard. I'll release a corrected/spec-compliant library shortly, with a version bump to signal that it will produce different results. I am inclined to follow your last suggestion (not general percent encoding/decoding - but scoped only to %0D, %0A, and %25). I do not dispute your observations about lack of harmony/compatibility with checksumming tools, or inconsistent validation behavior with prior labeled 1.0 implementations (including mine).

However, I also think that one should not selectively determine which parts of a published spec to implement, or require that implementations must gracefully or exhaustively handle data produced from previous versions. This implementation has a 'software agent' version string (embedded in the bagit package), so in theory a user could segregate bags created by the previous one and handle the validation differently.

Agreed imperfect, but I'd wait for a new spec to take steps beyond these...

Percent encodes file pathnames in manifests. Closes #33

pwinckles · 2022-02-17T16:30:27Z

@richardrodgers I think you'll want to do the %25 substitution on decode last otherwise you'll mistakenly decode a string like new%250Aline to new\nline instead of new%0Aline.

richardrodgers · 2022-02-17T21:57:45Z

Good catch - thanks

…

On Thu, Feb 17, 2022 at 11:30 AM Peter Winckles ***@***.***> wrote: @richardrodgers <https://github.com/richardrodgers> I think you'll want to do the %25 substitution on decode last otherwise you'll mistakenly decode a string like new%250Aline to new\nline instead of new%0Aline. — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKSHJMGDWKLK7B7NMOJ3ZDU3UPC7ANCNFSM5OPHTP7Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

acdha mentioned this issue Feb 16, 2022

Manifest filename escaping LibraryOfCongress/bagit-spec#46

Open

richardrodgers closed this as completed in 68d40ea Feb 17, 2022

richardrodgers added a commit that referenced this issue Feb 17, 2022

Merge pull request #37 from richardrodgers/manpath_encoding

17b4b58

Percent encodes file pathnames in manifests. Closes #33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path encoding bug #33

Path encoding bug #33

pwinckles commented Feb 15, 2022

pwinckles commented Feb 16, 2022

richardrodgers commented Feb 16, 2022

pwinckles commented Feb 17, 2022

richardrodgers commented Feb 17, 2022 via email

Path encoding bug #33

Path encoding bug #33

Comments

pwinckles commented Feb 15, 2022

pwinckles commented Feb 16, 2022

richardrodgers commented Feb 16, 2022

pwinckles commented Feb 17, 2022

richardrodgers commented Feb 17, 2022 via email