You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently discovered that the BagIt 1.0 specification requires that CR, LF, and % in file paths within manifest files are percent-encoded, and that there isn't a single BagIt implementation that does this correctly. Implementations either only encode CR and LF but not% or they encode nothing.
This implementation only encodes CR and LF but not%. This is problematic because it would fail to validate BagIt 1.0 bags that include file paths containing % characters. Likewise, it would create bags that would fail BagIt 1.0 validation in the case that there are paths that naturally contain percent-encoded characters. Furthermore, this implementation also does not generate valid BagIt 0.97 bags. The 0.97 spec does not encode any manifest paths and this implementation encodes CR and LF.
For example, let's say a bag contains the file data/file%0A1.txt. This file should be written to the manifest per the spec as data/file%250A1.txt. However, this implementation writes it as data/file%0A1.txt. This means, that when this implementation validates a properly constructed 1.0 bag it will look for the file data/file%250A1.txt which does not exist. Similarly, if another implementation that follows the spec attempts to validate a bag produced by this implementation, it would look for data/file\n1.txt, which does not exist.
It would seem desirable to me to move the ecosystem in the direction of properly implementing the 1.0 specification, while at the same acknowledging that there are a large number of 1.0 bags in existence that may then become invalid.
As such, it may be prudent to, when validating bags, fall back on a series of tests. You may want to first attempt to validate per the spec, and then, if a file cannot be found, attempt to locate it by either only decoding the CR and LF or leaving the path unchanged, ideally validating all of the files using the same method.
I have not examined fetch.txt implementations, but the same encoding requirements exist for paths in that file as well. This is potentially a thornier problem to address in a backward compatible way as it is unclear if the path data/file%250A1.txt is supposed to create data/file%250A1.txt (incorrect) or data/file%0A1.txt (correct).
Finally, I created a related ticket against the spec discussing this encoding problem, in particular how it breaks checksum utility compatibility.
The text was updated successfully, but these errors were encountered:
I should have customized this ticket more. Unlike the other implementations, bagit-python does not technically support BagIt 1.0. The main problem here is then that it is replacing CR and LF in 0.97 bags, when it should not be, but it does not have a broken 1.0 implementation because there isn't one.
I recently discovered that the BagIt 1.0 specification requires that
CR
,LF
, and%
in file paths within manifest files are percent-encoded, and that there isn't a single BagIt implementation that does this correctly. Implementations either only encodeCR
andLF
but not%
or they encode nothing.This implementation only encodes
CR
andLF
but not%
. This is problematic because it would fail to validate BagIt 1.0 bags that include file paths containing%
characters. Likewise, it would create bags that would fail BagIt 1.0 validation in the case that there are paths that naturally contain percent-encoded characters. Furthermore, this implementation also does not generate valid BagIt 0.97 bags. The 0.97 spec does not encode any manifest paths and this implementation encodesCR
andLF
.For example, let's say a bag contains the file
data/file%0A1.txt
. This file should be written to the manifest per the spec asdata/file%250A1.txt
. However, this implementation writes it asdata/file%0A1.txt
. This means, that when this implementation validates a properly constructed 1.0 bag it will look for the filedata/file%250A1.txt
which does not exist. Similarly, if another implementation that follows the spec attempts to validate a bag produced by this implementation, it would look fordata/file\n1.txt
, which does not exist.It would seem desirable to me to move the ecosystem in the direction of properly implementing the 1.0 specification, while at the same acknowledging that there are a large number of 1.0 bags in existence that may then become invalid.
As such, it may be prudent to, when validating bags, fall back on a series of tests. You may want to first attempt to validate per the spec, and then, if a file cannot be found, attempt to locate it by either only decoding the
CR
andLF
or leaving the path unchanged, ideally validating all of the files using the same method.I have not examined
fetch.txt
implementations, but the same encoding requirements exist for paths in that file as well. This is potentially a thornier problem to address in a backward compatible way as it is unclear if the pathdata/file%250A1.txt
is supposed to createdata/file%250A1.txt
(incorrect) ordata/file%0A1.txt
(correct).Finally, I created a related ticket against the spec discussing this encoding problem, in particular how it breaks checksum utility compatibility.
The text was updated successfully, but these errors were encountered: