Skip to content

About the BaRJ Cargo format

Esta Nagy edited this page Jun 19, 2024 · 15 revisions

TL;DR

Note

The BaRJ Cargo file format was designed during the implementation of the File BaRJ application to provide a bespoke solution for the use-cases of this backup and restore tool.

Requirements

The File BaRJ application needed an archive file format that could fulfill the following requirements:

  1. The archive file must be able to store directories, regular files and symbolic links
  2. The archive must support AES-256 GCM encryption of each entry in it, allowing usage of multiple encryption keys and randomized IV values per file.
  3. The archive must support a wide range of hashing algorithms to allow the user to select a desired algorithm for hash calculations.
    1. Hashes must be calculated for both the original and the archived content.
  4. The archive must support automatic splitting/chunking into multiple smaller files with configurable size limits to allow easy handling of manageable file chunks instead of a single large archive file.
  5. The archive (and the low level SDK reading/writing it) must be temper resistant
    1. Having built-in, automatic verification that all required archive chunks exist and have the correct sizes.
    2. Having built-in, automatic hash verification.
  6. The low level SDK reading and writing the archive must have measures in place to avoid path traversal attacks
    1. Only allow uniquely named entries
    2. Use normalized absolute paths for each entry inside the archive
    3. Disallow adding entries when their parent directory entries are not present in the archive
  7. The archive and the low level SDK creating it must support configurable compression algorithms to let the user decide whether they want to compress the data at all and in case they do, which algorithm should be used.
  8. The archive and the low level SDK creating it must have a way to add entries from streams without knowing how large the currently archived entry will be or what the data will look like.
    1. Temp files can be used if necessary for performance reasons, but have a way to archive entries without temp file creation as well to boost single thread archival efficiency.
  9. The archive format must allow merging multiple archives into a single archive if necessary to allow merging backups.
    1. Preferably do so without decryption, decompression, compression and re-encryption of the data to keep it as secure and as efficient as possible.

Solution overview

The reasons for not using readily available archival tools

As the well-established formats, like ZIP or tar.gz did not match our expectations, since none of them managed to meet all requirements, a new format was designed and implemented from the ground up. The biggest challenges were around configurability and the ability to archive entries from a streaming source at least in case of the previously existing formats. In fairness, the encryption and hash calculation functions could have been added if we did them before putting the entries into the archives, but this raised it's own issues. For example this would mean, that we won't know the exact size of the encrypted data (because of padding and the addition of the IV) and we would be forced to use temp files for each archive entry, making the encryption and compression a two step process. The question of ordering these two operations comes up in an advice [1] as well, concluding, that it is essentially a trade-off. Encrypting first means that we are encrypting the repetitive data (encrypting more data) and we compress the "random" cipher data later. This can reduce the speed of the operation as we are working more by encrypting more data end we can face poorer compression percentages as well, because the encryption can scramble the repetitive parts. The following sources can contribute more to the topic: [2], [3], [4], [5]. In conclusion, encrypting first is not a great fit for our use case, as we would like to reduce the archive size as much as possible and have some level of secrecy regarding the content.

Regarding the ability to use streaming data, it is very easy to see why ZIP or TAR cannot do this, because they are writing the metadata of the entry first, then the content later. In order to know sizes (before and after compression) and checksums, one would need to use temporary files when streaming data must be added to the archive. This was a challenge, as in case of File BaRJ, the single threaded implementation is intentionally not using any temp files and does everything only with streams.

Defining the file format

As a result, the BaRJ Cargo format is only storing the archived (individually compressed, then encrypted) data in the main archive files (plural as the large logical archive is split into smaller chunks on the file system). In order to know which file starts where and ensure their integrity we have introduced an additional archive file containing an archive index with the following data:

  • Entry path in the archive
  • Entry type (directory, regular file, symbolic link)
  • Encryption flag
  • Entry content boundary locators (missing in case of directories)
  • Entry metadata content boundary locators (can contain arbitrary information which we want to add about the original file, such as permissions, tags, etc.)

Each boundary locator contains the following:

  • Original size
  • Original hash
  • Archived size
  • Archived hash
  • Start pointer absolute from the start of the logical archive
  • End pointer absolute from the start of the logical archive
  • Chunk name where the entry's first byte is written
  • Chunk name where the entry's last byte is written
  • Start pointer relative to the chunk's start where the entry's first byte is written
  • End pointer relative to the chunk's start where the entry's last byte is written

The following diagram can illustrate how this works:

FileFormat

The archive contents and the index can be written in parallel. The properties file format was selected for the index format in order to further reduce the necessary memory footprint (since the properties file can be written sequentially with ease).

Writing an example

final var config = BarjCargoOutputStreamConfiguration.builder()
        .hashAlgorithm("SHA-256")
        .prefix("barj")
        .maxFileSizeMebibyte(1) //our minimal data won't reach the chunk limit
        .compressionFunction(IoFunction.IDENTITY_OUTPUT_STREAM) //no compression to see the content
        .folder(Path.of("/tmp/dir"))
        .build();
try (var stream = new BarjCargoArchiverFileOutputStream(config)) {
    stream.addDirectoryEntity("/dir", null,
            "arbitrary metadata of /dir");
    stream.addFileEntity("/dir/file1.ext", new ByteArrayInputStream("file1 content".getBytes()), null,
            "arbitrary metadata of /dir/file1.ext");
    stream.addSymbolicLinkEntity("/dir/file2.ext", "/dir/file1.txt", null,
            "arbitrary metadata of /dir/file2.ext");
    stream.addFileEntity("/dir/file3.ext", new ByteArrayInputStream("file3 content".getBytes()), null,
            "arbitrary metadata of /dir/file3.ext");
}

The result

barj.00001.cargo

arbitrary metadata of /dirfile1 contentarbitrary metadata of /dir/file1.ext/dir/file1.txtarbitrary metadata of /dir/file2.extfile3 contentarbitrary metadata of /dir/file3.ext

barj.index.cargo

# File BaRJ Cargo Archive Index
00000001.path:/dir
00000001.type:DIRECTORY
00000001.encrypt:false
00000001.metadata.rel.start.idx:0
00000001.metadata.rel.start.file:barj.00001.cargo
00000001.metadata.rel.end.idx:26
00000001.metadata.rel.end.file:barj.00001.cargo
00000001.metadata.abs.start.idx:0
00000001.metadata.abs.end.idx:26
00000001.metadata.orig.size:26
00000001.metadata.orig.hash:b5ba71cbf60c724f2ac86f1f818c60afae9d70fc29919614f05163c8748ff289
00000001.metadata.arch.size:26
00000001.metadata.arch.hash:b5ba71cbf60c724f2ac86f1f818c60afae9d70fc29919614f05163c8748ff289
00000002.path:/dir/file1.ext
00000002.type:REGULAR_FILE
00000002.encrypt:false
00000002.content.rel.start.idx:26
00000002.content.rel.start.file:barj.00001.cargo
00000002.content.rel.end.idx:39
00000002.content.rel.end.file:barj.00001.cargo
00000002.content.abs.start.idx:26
00000002.content.abs.end.idx:39
00000002.content.orig.size:13
00000002.content.orig.hash:1705789d380ee110bc09231df8af42a0cc564a1510ebd2168516d4985c40a263
00000002.content.arch.size:13
00000002.content.arch.hash:1705789d380ee110bc09231df8af42a0cc564a1510ebd2168516d4985c40a263
00000002.metadata.rel.start.idx:39
00000002.metadata.rel.start.file:barj.00001.cargo
00000002.metadata.rel.end.idx:75
00000002.metadata.rel.end.file:barj.00001.cargo
00000002.metadata.abs.start.idx:39
00000002.metadata.abs.end.idx:75
00000002.metadata.orig.size:36
00000002.metadata.orig.hash:8467fa2fb7ed6ac909285591309c882f7106ebde3c4d44d7342d11e303281810
00000002.metadata.arch.size:36
00000002.metadata.arch.hash:8467fa2fb7ed6ac909285591309c882f7106ebde3c4d44d7342d11e303281810
00000003.path:/dir/file2.ext
00000003.type:SYMBOLIC_LINK
00000003.encrypt:false
00000003.content.rel.start.idx:75
00000003.content.rel.start.file:barj.00001.cargo
00000003.content.rel.end.idx:89
00000003.content.rel.end.file:barj.00001.cargo
00000003.content.abs.start.idx:75
00000003.content.abs.end.idx:89
00000003.content.orig.size:14
00000003.content.orig.hash:8267a191a3ec453d93ea3db85d35f6f32ea74df7a192000a6df9ffb87990b36c
00000003.content.arch.size:14
00000003.content.arch.hash:8267a191a3ec453d93ea3db85d35f6f32ea74df7a192000a6df9ffb87990b36c
00000003.metadata.rel.start.idx:89
00000003.metadata.rel.start.file:barj.00001.cargo
00000003.metadata.rel.end.idx:125
00000003.metadata.rel.end.file:barj.00001.cargo
00000003.metadata.abs.start.idx:89
00000003.metadata.abs.end.idx:125
00000003.metadata.orig.size:36
00000003.metadata.orig.hash:9cd0d00d3de91c86b6dbcc9f4aa47c08127c55db6c93947f3be745a374f597dd
00000003.metadata.arch.size:36
00000003.metadata.arch.hash:9cd0d00d3de91c86b6dbcc9f4aa47c08127c55db6c93947f3be745a374f597dd
00000004.path:/dir/file3.ext
00000004.type:REGULAR_FILE
00000004.encrypt:false
00000004.content.rel.start.idx:125
00000004.content.rel.start.file:barj.00001.cargo
00000004.content.rel.end.idx:138
00000004.content.rel.end.file:barj.00001.cargo
00000004.content.abs.start.idx:125
00000004.content.abs.end.idx:138
00000004.content.orig.size:13
00000004.content.orig.hash:ea290f34cbd4a8143959aa8f89bb4a37b4de278d456b5e930320c50b30cea636
00000004.content.arch.size:13
00000004.content.arch.hash:ea290f34cbd4a8143959aa8f89bb4a37b4de278d456b5e930320c50b30cea636
00000004.metadata.rel.start.idx:138
00000004.metadata.rel.start.file:barj.00001.cargo
00000004.metadata.rel.end.idx:174
00000004.metadata.rel.end.file:barj.00001.cargo
00000004.metadata.abs.start.idx:138
00000004.metadata.abs.end.idx:174
00000004.metadata.orig.size:36
00000004.metadata.orig.hash:bc44ce95ea353244949dc3da703a68f0c49c3860f5652261033db22d194e6b55
00000004.metadata.arch.size:36
00000004.metadata.arch.hash:bc44ce95ea353244949dc3da703a68f0c49c3860f5652261033db22d194e6b55
last.chunk.index:1
last.chunk.size:174
max.chunk.size:1048576
last.entity.index:4
total.size:174
version:2

Reading all entries of the example sequentially

final var config = BarjCargoInputStreamConfiguration.builder()
        .hashAlgorithm("SHA-256")
        .prefix("barj")
        .compressionFunction(IoFunction.IDENTITY_INPUT_STREAM)
        .folder(Path.of("/tmp/dir"))
        .build();

final var source = new BarjCargoArchiveFileInputStreamSource(config);
final var iterator = source.getIterator();

Assertions.assertTrue(iterator.hasNext());
//verify /dir
final var dir = iterator.next();
Assertions.assertEquals(FileType.DIRECTORY, dir.getFileType());
Assertions.assertEquals("/dir", dir.getPath());
Assertions.assertEquals("arbitrary metadata of /dir", dir.getMetadata(null));
//verify /dir/file1.ext
Assertions.assertTrue(iterator.hasNext());
final var file1 = iterator.next();
Assertions.assertEquals(FileType.REGULAR_FILE, file1.getFileType());
Assertions.assertEquals("/dir/file1.ext", file1.getPath());
//the order is important, read content first!
Assertions.assertArrayEquals("file1 content".getBytes(), file1.getFileContent(null).readAllBytes());
Assertions.assertEquals("arbitrary metadata of /dir/file1.ext", file1.getMetadata(null));
//verify /dir/file2.ext
Assertions.assertTrue(iterator.hasNext());
final var file2 = iterator.next();
Assertions.assertEquals(FileType.SYMBOLIC_LINK, file2.getFileType());
Assertions.assertEquals("/dir/file2.ext", file2.getPath());
Assertions.assertEquals("/dir/file1.txt", file2.getLinkTarget(null));
Assertions.assertEquals("arbitrary metadata of /dir/file2.ext", file2.getMetadata(null));
//verify /dir/file3.ext
Assertions.assertTrue(iterator.hasNext());
final var file3 = iterator.next();
Assertions.assertEquals(FileType.REGULAR_FILE, file3.getFileType());
Assertions.assertEquals("/dir/file3.ext", file3.getPath());
Assertions.assertArrayEquals("file3 content".getBytes(), file3.getFileContent(null).readAllBytes());
Assertions.assertEquals("arbitrary metadata of /dir/file3.ext", file3.getMetadata(null));

Assertions.assertFalse(iterator.hasNext());

Security

In order to reduce the amount of information leaked and the risk with it, we have applied the following measures.

Multiple keys

The low level API supports the use of any number of AES secret keys for the entries. It is even possible to use different keys for each entry if necessary, but such an extreme would potentially put a huge burden on the caller as the secret keys are not managed by the BaRJ archiver SDK. This can help us with limiting the amount of data encrypted per key, which is an important consideration for AES GCM encryption [6].

Random IV

The archival process makes sure to generate secure random IV values for each entry to reduce the risk of reusing the same key and the same IV for two or more entries. The IV values are written together with the encrypted data.

Encrypting the index

To avoid leaking the names and locations of the archive entries, we can provide an AES secret key to encrypt the index file as well. This process will use the same, compress then encrypt approach as mentioned earlier in case of the archive entry content. Since the index is encrypted as well, an attacker can waste more time on figuring out how many entries we may have, where do they start and end and what are the IV values used. Each of these can help us protect our valuable data a bit longer.

Addressing the trade-off

As previously mentioned, the BaRJ Cargo entries are encrypted after the compression step. It is worth noting though, that due to the configurable nature of the archival, the user can easily select no compression and avoid the whole controversy around the order of these operations. It can make sense if we assume that the cypher will not be compressible.

Performance

The BaRJ Cargo format supports the following features which might help you achieve better performance during archival or accessing archived data.

Single threaded or parallel writes

The BaRJ Cargo format's low level SDK provides two implementations for writes:

  • The single threaded write is not using any temp files, performing archival in a single step.
  • The parallel implementation uses an executor for performing the compression and encryption steps parallel, then a separate thread can merge the generated temp files into the archive sequentially

Reading a single entry

When reading a single entry, we don't need to read the full archive. This is thanks to the index details which we stored. The performance gain can be even more significant if we have more chunks, as the implementation can make sure, that we are not touching those chunks which are irrelevant for our entry in scope.

Reading multiple entries sequentially

The implementation, which is recommended for multiple entries, uses an iterator, that can sequentially loop through the selected entries. Of course it tries to limit reads to the relevant files in these cases as well by starting at the first relevant chunk and ending right after the last entry. This restore process has a multi-threaded implementation that can open the archive more than one times and split the workload between the threads.

Tip

Based on our internal testing, using 2-3 threads is the sweet spot (both for read and write use-cases), balancing performance and resource use. It is worth noting, that it does not necessarily mean 2x-3x speed, because edge cases, like having one very large file and many small files can cause slow-downs. Feel free to experiment with different options and determine which is best for your data and your needs.

Configurable, optional operations

The BaRJ Cargo format is configurable, every computation intensive operation can be turned off. As a result, users can make the trade-offs they want between:

  • File size and compression speed (using no compression at all in an edge case)
  • Security and archival speed (turning off encryption if it is not required)
  • Integrity guarantees and speed (using faster, but less reliable hash algorithms, or no hashing at all if it is not important)