-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifest filename escaping #46
Comments
Thanks for the report — this is a pretty busy week for me but I had a couple of initial thoughts. Clearly we need to add some additional tests to https://github.com/LibraryOfCongress/bagit-conformance-suite/! One thing I was wondering is how many of these tools promise support for BagIt 1.0, and thus whether this might simply be treated as part of bringing an existing (likely 0.97-focused) tool into full support for the spec. I definitely would like to make a 1.0 update to bagit-python. That language was introduced between 0.97 and the first 1.0 drafts, and I'm not sure whether which came first: this portion of the payload manifest and the same portion of the |
It appears the encoding was introduced here as a response to newline characters appearing in some file names, a case the 0.97 spec did not support. However, the draft language did not also include encoding All of the implementations that I opened issues for claim to support 1.0. With the exception of bagit-python, I did not open issues against 0.97 implementations. If there ever is a next version of the BagIt spec, I think it would be nice if the escaping was handled the same way as checksum utilities. Compatibility with them is enormously useful. |
I might also point out that if implementations percent-decode by only decoding the I don't think I have ever used a percent-encoding library that allows you to control the characters that are decoded, so doing this will likely require a custom implementation or a series of string search and replaces. |
See issue in my repo for response, in line with these remarks... |
Link to that comment: richardrodgers/bagit#33 (comment) |
After spending some time discussing this with some coworkers, I see that I did a poor job succinctly stating a desired change to the spec. The existing language:
Should be changed to something like:
This would make the BagIt format compatible with unix checksum utilities. |
In regards to writing file paths in manifest files, the spec states the following:
My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While
CR
andLF
are rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of%
too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode
CR
andLF
and do not encode%
. The first behavior is broken for file names that contain anLF
orCR
and the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.
Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as
\n
and a\
is added to the beginning of the line. Additionally, literal\
characters are escaped with another\
and a\
is also added to the beginning of the line.For example, let's say that we have the file named
new\nline
(important, this must be an actual newline and not the characters\
andn
) and one namedback\slash
, and then executed the following:This seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.
The text was updated successfully, but these errors were encountered: