-
-
Notifications
You must be signed in to change notification settings - Fork 26
Generated ZIP file reuse feature design
⚠️ 🔧 This wiki article contains technical design specifications intended for a technical audience 🔧 ⚠️
If you just are an end user, please consider that you probably are better off reading other documentation that uses simpler terms, unless you really want to read the nitty-gritty details, or are looking for some very specific fact about the topic.
PackSquash v0.3.0 introduced a feature called append mode, previously generated ZIP file reuse or caching, which allows it to detect what files were changed since the last time it was executed on a pack with high accuracy under most circumstances. When this feature is used and a likely unchanged file is detected, all processing is skipped for that pack file, so it is just added and copied to the new version of the output ZIP file. This is much faster than processing it again.
Obviously, for a feature like this to work, it is necessary to store data somewhere to track the file changes in a way that can be efficiently read back and compared with later. So, in addition to the functional requirement stated above, the feature must satisfy the following non-functional requirements:
- It must be at least more time-efficient than processing the pack file again, because otherwise the feature would be useless.
- It must be privacy-first and storage-efficient, saving the lesser amount of data needed for the feature to work. The data must not be readable by any third party under reasonable threat models.
- It must not reduce PackSquash portability, require extra installation steps, or store files in some hardcoded, system or user dependent data store that may make it harder to do reproducible runs or to support use cases which involve running PackSquash by different users, operating systems or machines. In other words, it must be as stateless as possible for the end user.
- It must be a reasonably simple design, as simple to maintain and reason about as possible.
- Lastly, it must be as user-friendly as reasonably possible, considering the previous requirements. It is acceptable to exchange some user-friendliness if that demonstrably helps fulfilling any of the previous requirements, but it is not acceptable to overdo this point.
The design of the current PackSquash implementation for this feature revolves around the following key functional blocks and concepts:
- System identifiers.
- Filesystem-provided file modification timestamps.
- Squash Times.
- Format-preserving encryption (FPE), also known as sanitization in some technical literature.
The following flowchart shows how these items relate to each other in a high-level overview of the pack file processing algorithm, focusing on the aspects relevant to the feature at hand.
This design satisfies the requirements stated above for the following reasons:
- It is trivial to conclude that it allows to skip processing unchanged files.
- Getting the file modification time from the filesystem is an operation that can be done very quickly in most filesystems. This operation might be slower in distributed filesystems, but so might be every operation with them in general. It also can be done efficiently, because file metadata is already retrieved to iterate through the pack directory, so checking the modification time field only adds a marginal time cost. Timestamps are also quick to compare and compute, unlike hashes, which require looking at the file contents.
- Modification timestamps are the bare minimum amount of data needed to check whether a file has plausibly changed or not. Also, we believe in good faith, albeit without formal proof or review by a recognized cryptography expert, that the designed cryptosystem is effective to maintain timestamp and system ID confidentiality under ciphertext-only attacks and chosen-plaintext attacks. This claim is at least plausible due to the guarantees provided by the cryptographic primitives used, which are discussed in NIST Special Publication 800-38G § Appendix B.
- The timestamps are stored in the generated ZIP file using standard fields reserved for this purpose in the ZIP file records, so no additional data stores are required. Transferring the generated ZIP file also transfers the timestamps.
- The design is based on standard cryptographic primitives and simple timestamp comparisons, which are easy to maintain and reason about.
- Because the metadata is kept within the ZIP file, there is little room for user error, providing that the user is aware that some environment changes may render that metadata unable to be read back properly or obsolete. These changes are documented in the options files wiki article. Moreover, this design can work automagically in the most common use cases without any configuration.
The following sections present the information about key design elements that is needed to have a thorough understanding of the design.
System identifiers are unsigned 128-bit integers that are guaranteed to be more repeatable across PackSquash executions than a randomly generated number, fairly unpredictable, and unique. In practice, the way to get a 128-bit integer that is somewhat unique, unpredictable and repeatable is to query the computer system for hardware or software data that is expected to be long-lived, so that is why system identifiers are named as such. The reader may be more familiar with other terms that refer to a similar (albeit not identical) concept, like HWID, hardware ID or machine ID.
The definition above is so generic because the ways to get a system identifier change across operating systems and hardware platforms. They may also change within the same environment (i.e. operating system install and hardware combination) due to user intervention or another usually atypical factors. The user may also spoof a system identifier, but this is not a concern in typical scenarios because it is reasonable to assume that no malicious party is in control of the computer that is used to generate ZIP files, and that the end user has no interest in spoofing identifiers in a way that undermines the suitability of the system identifier for the cryptosystem. Social engineering attacks are excluded from the threat model, because they can compromise every conceivable cryptosystem. In this vein, it should be noted that if any party succeeds in making PackSquash use a weak (i.e. predictable, non-unique) system identifier the security of the whole cryptosystem would be compromised.
Another consequence of the facts stated above is that system identifiers may be of varying suitability for long-term cryptographic key generation. The impact that this consequence has in information security and usability is addressed with the system identifier selection and checking algorithm described in the following steps:
- Every candidate system identifier for the running environment is retrieved.
- The candidate system identifiers are sorted in descending order by their suitability score. The suitability score is the metric Shannon entropy of the system identifier, minus a constant number if the system identifier is believed to be volatile (i.e. that may change without any user intervention that may be reasonably expected to be targeted at changing the identifier). If a tie happens, non-volatile identifiers are considered to have a higher score. If both candidate identifiers have the same volatility, the candidate considered to have a higher score is the one who has a higher priority. The priority is a hardcoded integer that reflects how suitable the PackSquash authors think that the candidate system identifiers provided by some method are. In addition, priority values are chosen so that no two candidate identifiers can share the same priority value.
- The candidate system identifier with the most suitability score is used as the system identifier.
- If the chosen system identifier is volatile or does not meet a hardcoded minimum entropy threshold, warnings are emitted, so the user is notified of the fact that the identifier may not be suitable enough, and can take corrective measures.
On Windows, PackSquash retrieves candidate system identifiers from the following sources:
- The
MachineGuid
registry key located atHKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Cryptography
. - The SMBIOS system UUID, read through the
Win32_ComputerSystemProduct
WMI class, that is provided by the motherboard firmware on x86 hardware platforms. - The last Windows service or feature pack installation date, stored in the registry at
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\InstallDate
.
On Linux, PackSquash retrieves candidate system identifiers from the following sources:
- The D-Bus/systemd machine ID.
- The boot ID provided by the kernel via sysctl.
- The
gethostid()
C function, defined by POSIX.1-2017 as an extension to the ISO C standard.
On macOS, PackSquash retrieves candidate system identifiers from the following sources:
- The
IOPlatformSerialNumber
of theIOPlatformExpertDevice
I/O Kit device. - The
gethostid()
C function, defined by POSIX.1-2017 as an extension to the ISO C standard.
Finally, on every platform, the PACKSQUASH_SYSTEM_ID
environment variable can be set to a RFC4122 UUID string that will be parsed as the system identifier, ignoring any other candidates. This is meant to support use cases where it is necessary to use a fixed system identifier, be it because PackSquash is not capable of getting a suitable one or because it is expected to be used accross environments that may have different identifiers. Suitable values for this environment variable can be generated in this web page, which is provided by the PackSquash authors as an example of a good and convenient way to generate such UUIDs.
Explaining the details of how filesystems store file modification timestamps is out of scope for this document.
It suffices to remark that different filesystems may store this time data with varying precision, accuracy and reliability. Some filesystems may not provide suitable modification timestamps at all. In their default configuration, the most common non-distributed filesystems (NTFS, ext2, ext3, ext4, AFS, HFS+, etc.) store suitable modification timestamps, but it is advised to double check that this holds true if you use less common filesystems or configurations.
Another interesting remark is that PackSquash assumes that files with a modification time before their Squash Time were not changed in any way. This is usually a reasonable assumption, but it may not hold true if file modification timestamps are modified.
Squash Times are the count of quarter-seconds since Monday, 22 December 2014 0:00:00 (UTC), excluding leap seconds.
Mathematically, a Squash Time is defined as follows. Let be the POSIX time that is to be converted to a Squash Time, with millisecond precision (but not necessarily accuracy). Let be the POSIX time value for the Squash Time epoch, which is a constant whose value is . The Squash Time is:
Squash Times are always stored as unsigned 32-bit integers. So the highest date that can be represented is Wednesday, 30 December 2048 13:37:03 (UTC). The reasons for using Squash Times over second precision POSIX times are:
- They avert the year 2038 problem while having 4 times more precision, thanks to the choice of a much more recent epoch.
- Because they always are 4 bytes wide, they fit in the 4 bytes that the ZIP file records dedicate to the modification time and date fields, where they are stored after encryption.
Explaining the details of FPE is out of scope for this document. However, it is interesting to note that the CRC32 of a pack file is used as a tweak for the FF1 mode of operation of the AES-128 block cipher, so two files with different CRC32 will have different ciphertexts for the same plaintext.