Manifest filename escaping #46

pwinckles · 2022-02-13T17:00:35Z

In regards to writing file paths in manifest files, the spec states the following:

If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].

My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While CR and LF are rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of % too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.

I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode CR and LF and do not encode %. The first behavior is broken for file names that contain an LF or CR and the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.

I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.

Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as \n and a \ is added to the beginning of the line. Additionally, literal \ characters are escaped with another \ and a \ is also added to the beginning of the line.

For example, let's say that we have the file named new\nline (important, this must be an actual newline and not the characters \ and n) and one named back\slash, and then executed the following:

# On linux
$ sha256sum *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de  back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee  new\nline

# On mac
$ shasum -a 256 *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de  back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee  new\nline

This seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.

The text was updated successfully, but these errors were encountered:

acdha · 2022-02-16T01:08:36Z

Thanks for the report — this is a pretty busy week for me but I had a couple of initial thoughts. Clearly we need to add some additional tests to https://github.com/LibraryOfCongress/bagit-conformance-suite/!

One thing I was wondering is how many of these tools promise support for BagIt 1.0, and thus whether this might simply be treated as part of bringing an existing (likely 0.97-focused) tool into full support for the spec. I definitely would like to make a 1.0 update to bagit-python.

That language was introduced between 0.97 and the first 1.0 drafts, and I'm not sure whether which came first: this portion of the payload manifest and the same portion of the fetch.txt description which obviously is very URL-focused. I'd have to check the old discussion but I believe the intention was to avoid ambiguity by making it always be safe to run filenames through a URL decoding function even in cases where the original filename was itself URL-encoded (which is not uncommon in certain communities). Since most of the common escaping conventions are valid characters in filenames on at least some operating system and filesystem combinations (e.g. new\nline is a valid filename), I'm not sure there's a better option than working with each of the implementers to cover this case.

pwinckles · 2022-02-16T02:18:16Z

It appears the encoding was introduced here as a response to newline characters appearing in some file names, a case the 0.97 spec did not support. However, the draft language did not also include encoding %, which I think is where the confusion originated. Encoding the % is essential otherwise you would not be able to distinguish between the distinct files new\nline.txt and new%0Aline.txt.

All of the implementations that I opened issues for claim to support 1.0. With the exception of bagit-python, I did not open issues against 0.97 implementations.

If there ever is a next version of the BagIt spec, I think it would be nice if the escaping was handled the same way as checksum utilities. Compatibility with them is enormously useful.

pwinckles · 2022-02-16T10:37:44Z

I might also point out that if implementations percent-decode by only decoding the CR, LF, and ℅ characters then they will remain mostly compatible with incorrect 1.0 implementations. Normally, when you percent-decode you decode any encoded character as described here. However, by not decoding every encoded character a correct 1.0 implementation could validate an unescaped path like test℅201.txt from a current implementation.

I don't think I have ever used a percent-encoding library that allows you to control the characters that are decoded, so doing this will likely require a custom implementation or a series of string search and replaces.

richardrodgers · 2022-02-16T16:15:58Z

See issue in my repo for response, in line with these remarks...

acdha · 2022-02-16T16:24:03Z

Link to that comment: richardrodgers/bagit#33 (comment)

pwinckles · 2022-03-04T20:26:06Z

After spending some time discussing this with some coworkers, I see that I did a poor job succinctly stating a desired change to the spec.

The existing language:

If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].

Should be changed to something like:

If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a backslash (\), then those characters MUST be replaced with the literal strings \n, \r, \r\n, and \\ respectively. Additionally, if any characters are replaced in a path, then the manifest entry line MUST be prefixed with a backslash (\). For example: \d8e8fca2dc0f896fd7cb4cb0031ba249 file-with\nnewline

This would make the BagIt format compatible with unix checksum utilities.

This was referenced Feb 15, 2022

Manifest filename escaping APTrust/dart#509

Open

Manifest filename escaping APTrust/preservation-services#54

Open

gwiedeman mentioned this issue Mar 11, 2022

Bagit spec filepath issue and bagit-python version conflict UAlbanyArchives/mailbagit#106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manifest filename escaping #46

Manifest filename escaping #46

pwinckles commented Feb 13, 2022 •

edited

Loading

acdha commented Feb 16, 2022

pwinckles commented Feb 16, 2022 •

edited

Loading

pwinckles commented Feb 16, 2022

richardrodgers commented Feb 16, 2022

acdha commented Feb 16, 2022

pwinckles commented Mar 4, 2022

Manifest filename escaping #46

Manifest filename escaping #46

Comments

pwinckles commented Feb 13, 2022 • edited Loading

acdha commented Feb 16, 2022

pwinckles commented Feb 16, 2022 • edited Loading

pwinckles commented Feb 16, 2022

richardrodgers commented Feb 16, 2022

acdha commented Feb 16, 2022

pwinckles commented Mar 4, 2022

pwinckles commented Feb 13, 2022 •

edited

Loading

pwinckles commented Feb 16, 2022 •

edited

Loading