Skip to content

Commit

Permalink
Merge pull request #63 from uktrade/feat/streamable-uncompressed
Browse files Browse the repository at this point in the history
feat: stream-writable uncompressed member files with known size and crc32
  • Loading branch information
michalc authored Jul 16, 2023
2 parents 3f331c7 + ec92583 commit 6c6db48
Show file tree
Hide file tree
Showing 4 changed files with 257 additions and 23 deletions.
12 changes: 12 additions & 0 deletions docs/exceptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,18 @@ Exceptions raised by the source iterables are passed through the `stream_zip` fu

Base class for errors relating to invalid arguments

- **ZipIntegrityError**

An integrity check failed

- **CRC32IntegrityError**

The CRC32 calculated from data did not match the CRC32 passed into the method

- **UncompressedSizeIntegrityError**

The uncompressed size of data did not match the uncompressed size passed into the method

- **ZipOverflowError** (also inherits from the **OverflowError** built-in)

The size or positions of data in the ZIP are too large to store using the requested method
Expand Down
14 changes: 9 additions & 5 deletions docs/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,30 @@
Each member file of the ZIP is compressed with one following methods.


## ZIP_32, NO_COMPRESSION_32
## ZIP_32, NO_COMPRESSION_32, NO_COMPRESSION_32(uncompressed_size, crc_32)

These methods are the historical standard methods for ZIP files.

`ZIP_32` compresses the file by default, but it is affected the `get_compressobj` parameter to `stream_unzip`. For example, by passing `get_compressobj=lambda: zlib.compressobj(wbits=-zlib.MAX_WBITS, level=0)`, the `level=0` part would result in this file not being compressed. Its size would increase slightly due to overhead of the underlying algorithm.

`NO_COMPRESSION_32` does not compress the member file, and is not affected by the `get_compressobj` parameter to `stream_unzip`. However, its entire contents is buffered in memory before output begins, and so should not be used for large files. It size does not increase - in the final ZIP file the contents of each `NO_COMPRESSION_32` member file is present byte-for-byte.
Both `NO_COMPRESSION_32` and `NO_COMPRESSION_32(uncompressed_size, crc_32)` store the contents of the file in the ZIP uncompressed exactly as supplied, and are not affected by the `get_compressobj` parameter to `stream_unzip`.

Each member file is limited to 4GiB (gibibyte). This limitation is on the uncompressed size of the data, and (if `ZIP_32`) the compressed size of the data, and how far the start of the member file is from the beginning in the final ZIP file. A `ZIP_32` or `NO_COMPRESSION_32` file can also not be later than the 65,535th member file in a ZIP. If a file only has `ZIP_32` or `NO_COMPRESSION_32` members, the entire file is a Zip32 file, and end of the final member file must be less than 4GiB from the beginning of the final ZIP. If these limits are breached, a `ZipOverflowError` will be raised.
For `NO_COMPRESSION_32` the entire contents are buffered in memory before output begins, and so should not be used for large files. For `NO_COMPRESSION_32(uncompressed_size, crc_32)` the contents are streamed, but at the price of having to determine the uncompressed size and CRC 32 of the contents beforehand. These limitations, although awkward when writing the ZIP, allow the ZIP file to be read in a streaming way.

Each member file using using one of these methods is limited to 4GiB (gibibyte). This limitation is on the uncompressed size of the data, and (if `ZIP_32`) the compressed size of the data, and how far the start of the member file is from the beginning in the final ZIP file. Also, each member file cannot be later than the 65,535th member file in a ZIP. If a file only has only these members, the entire file is a Zip32 file, and the end of the final member file must be less than 4GiB from the beginning of the final ZIP. If these limits are breached, a `ZipOverflowError` will be raised.

This has very high support. You can usually assume anything that can open a ZIP file can open ZIP files with only `ZIP_32` or `NO_COMPRESSION_32` members.


## ZIP_64, NO_COMPRESSION_64
## ZIP_64, NO_COMPRESSION_64, NO_COMPRESSION_64(uncompressed_size, crc_32)

These methods use the Zip64 extension to the original ZIP format.

`ZIP_64` compresses the file by default, but it is affected the `get_compressobj` parameter to `stream_unzip`. For example, by passing `get_compressobj=lambda: zlib.compressobj(wbits=-zlib.MAX_WBITS, level=0)`, the `level=0` part would result in this file not being compressed. However, its size would increase slightly due to overhead of the underlying algorithm.

`NO_COMPRESSION_64` does not compress the member file, and is not affected by the `get_compressobj` parameter to `stream_unzip`. However, its entire contents is buffered in memory before output begins, and so should not be used for large files. It size does not increase - in the final ZIP file the contents of each `NO_COMPRESSION_32` member file is present byte-for-byte.
Both `NO_COMPRESSION_64` and `NO_COMPRESSION_64(uncompressed_size, crc_32)` store the contents of the file in the ZIP uncompressed exactly as supplied, and are not affected by the `get_compressobj` parameter to `stream_unzip`.

For `NO_COMPRESSION_64` the entire contents are buffered in memory before output begins, and so should not be used for large files. For `NO_COMPRESSION_64(uncompressed_size, crc_32)` the contents are streamed, but at the price of having to determine the uncompressed size and CRC 32 of the contents beforehand. These limitations, although awkward when writing the ZIP, allow the ZIP file to be read in a streaming way.

Each member file is limited to 16EiB (exbibyte). This limitation is on the uncompressed size of the data, and (if `ZIP_64`) the compressed size of the data, and how far the member starts from the beginning in the final ZIP file. If these limits are breached, a `ZipOverflowError` will be raised.

Expand Down
190 changes: 172 additions & 18 deletions stream_zip.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,47 @@
from struct import Struct
import zlib

_NO_COMPRESSION_32 = object()
_NO_COMPRESSION_64 = object()
# Private methods

_NO_COMPRESSION_BUFFERED_32 = object()
_NO_COMPRESSION_BUFFERED_64 = object()
_NO_COMPRESSION_STREAMED_32 = object()
_NO_COMPRESSION_STREAMED_64 = object()
_ZIP_32 = object()
_ZIP_64 = object()

_AUTO_UPGRADE_CENTRAL_DIRECTORY = object()
_NO_AUTO_UPGRADE_CENTRAL_DIRECTORY = object()

def NO_COMPRESSION_32(offset, default_get_compressobj):
return _NO_COMPRESSION_32, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj
def __NO_COMPRESSION_BUFFERED_32(offset, default_get_compressobj):
return _NO_COMPRESSION_BUFFERED_32, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, None, None

def __NO_COMPRESSION_BUFFERED_64(offset, default_get_compressobj):
return _NO_COMPRESSION_BUFFERED_64, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, None, None

def NO_COMPRESSION_64(offset, default_get_compressobj):
return _NO_COMPRESSION_64, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj
def __NO_COMPRESSION_STREAMED_32(uncompressed_size, crc_32):
def method_compressobj(offset, default_get_compressobj):
return _NO_COMPRESSION_STREAMED_32, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, uncompressed_size, crc_32
return method_compressobj

def __NO_COMPRESSION_STREAMED_64(uncompressed_size, crc_32):
def method_compressobj(offset, default_get_compressobj):
return _NO_COMPRESSION_STREAMED_64, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, uncompressed_size, crc_32
return method_compressobj

# Public methods

def NO_COMPRESSION_32(uncompressed_size, crc_32):
return __NO_COMPRESSION_STREAMED_32(uncompressed_size, crc_32)

def NO_COMPRESSION_64(uncompressed_size, crc_32):
return __NO_COMPRESSION_STREAMED_64(uncompressed_size, crc_32)

def ZIP_32(offset, default_get_compressobj):
return _ZIP_32, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj
return _ZIP_32, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, None, None

def ZIP_64(offset, default_get_compressobj):
return _ZIP_64, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj
return _ZIP_64, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, None, None

def ZIP_AUTO(uncompressed_size, level=9):
def method_compressobj(offset, default_get_compressobj):
Expand All @@ -37,7 +59,7 @@ def method_compressobj(offset, default_get_compressobj):
# so Python could be causing extra deflate-chunks output which could break the limit. However, couldn't
# get output of sized 4293656841 to break the Zip32 bound of 0xffffffff here for any level, including 0
method = _ZIP_64 if uncompressed_size > 4293656841 or offset > 0xffffffff else _ZIP_32
return (method, _AUTO_UPGRADE_CENTRAL_DIRECTORY, lambda: zlib.compressobj(level=level, memLevel=8, wbits=-zlib.MAX_WBITS))
return (method, _AUTO_UPGRADE_CENTRAL_DIRECTORY, lambda: zlib.compressobj(level=level, memLevel=8, wbits=-zlib.MAX_WBITS), None, None)
return method_compressobj


Expand Down Expand Up @@ -114,7 +136,7 @@ def _raise_if_beyond(offset, maximum, exception_class):
if offset > maximum:
raise exception_class()

def _zip_64_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, _get_compress_obj, chunks):
def _zip_64_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, chunks):
file_offset = offset

_raise_if_beyond(file_offset, maximum=0xffffffffffffffff, exception_class=OffsetOverflowError)
Expand Down Expand Up @@ -177,7 +199,7 @@ def _zip_64_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra
0xffffffff, # Offset of local header - since zip64
), name_encoded, extra

def _zip_32_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, _get_compress_obj, chunks):
def _zip_32_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, chunks):
file_offset = offset

_raise_if_beyond(file_offset, maximum=0xffffffff, exception_class=OffsetOverflowError)
Expand Down Expand Up @@ -255,7 +277,7 @@ def _zip_data(chunks, _get_compress_obj, max_uncompressed_size, max_compressed_s

return uncompressed_size, compressed_size, crc_32

def _no_compression_64_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, _get_compress_obj, chunks):
def _no_compression_64_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, chunks):
file_offset = offset

_raise_if_beyond(file_offset, maximum=0xffffffffffffffff, exception_class=OffsetOverflowError)
Expand Down Expand Up @@ -314,7 +336,7 @@ def _no_compression_64_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at
), name_encoded, extra


def _no_compression_32_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, _get_compress_obj, chunks):
def _no_compression_32_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, chunks):
file_offset = offset

_raise_if_beyond(file_offset, maximum=0xffffffff, exception_class=OffsetOverflowError)
Expand Down Expand Up @@ -379,8 +401,126 @@ def _chunks():

return chunks, size, crc_32

def _no_compression_streamed_64_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, chunks):
file_offset = offset

_raise_if_beyond(file_offset, maximum=0xffffffffffffffff, exception_class=OffsetOverflowError)

extra = zip_64_local_extra_struct.pack(
zip_64_extra_signature,
16, # Size of extra
uncompressed_size, # Uncompressed
uncompressed_size, # Compressed
) + mod_at_unix_extra
yield from _(local_header_signature)
yield from _(local_header_struct.pack(
45, # Version
b'\x00\x08', # Flags - utf-8 file names
0, # Compression - no compression
mod_at_ms_dos,
crc_32,
0xffffffff, # Compressed size - since zip64
0xffffffff, # Uncompressed size - since zip64
len(name_encoded),
len(extra),
))
yield from _(name_encoded)
yield from _(extra)

yield from _no_compression_streamed_data(chunks, uncompressed_size, crc_32, 0xffffffffffffffff)

extra = zip_64_central_directory_extra_struct.pack(
zip_64_extra_signature,
24, # Size of extra
uncompressed_size, # Uncompressed
uncompressed_size, # Compressed
file_offset,
) + mod_at_unix_extra
return central_directory_header_struct.pack(
45, # Version made by
3, # System made by (UNIX)
45, # Version required
0, # Reserved
b'\x00\x08', # Flags - utf-8 file names
0, # Compression - none
mod_at_ms_dos,
crc_32,
0xffffffff, # Compressed size - since zip64
0xffffffff, # Uncompressed size - since zip64
len(name_encoded),
len(extra),
0, # File comment length
0, # Disk number
0, # Internal file attributes - is binary
external_attr,
0xffffffff, # File offset - since zip64
), name_encoded, extra


def _no_compression_streamed_32_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, chunks):
file_offset = offset

_raise_if_beyond(file_offset, maximum=0xffffffff, exception_class=OffsetOverflowError)

extra = mod_at_unix_extra
yield from _(local_header_signature)
yield from _(local_header_struct.pack(
20, # Version
b'\x00\x08', # Flags - utf-8 file names
0, # Compression - no compression
mod_at_ms_dos,
crc_32,
uncompressed_size, # Compressed
uncompressed_size, # Uncompressed
len(name_encoded),
len(extra),
))
yield from _(name_encoded)
yield from _(extra)

yield from _no_compression_streamed_data(chunks, uncompressed_size, crc_32, 0xffffffff)

return central_directory_header_struct.pack(
20, # Version made by
3, # System made by (UNIX)
20, # Version required
0, # Reserved
b'\x00\x08', # Flags - utf-8 file names
0, # Compression - none
mod_at_ms_dos,
crc_32,
uncompressed_size, # Compressed
uncompressed_size, # Uncompressed
len(name_encoded),
len(extra),
0, # File comment length
0, # Disk number
0, # Internal file attributes - is binary
external_attr,
file_offset,
), name_encoded, extra

def _no_compression_streamed_data(chunks, uncompressed_size, crc_32, maximum_size):
actual_crc_32 = zlib.crc32(b'')
size = 0
for chunk in chunks:
actual_crc_32 = zlib.crc32(chunk, actual_crc_32)
size += len(chunk)
_raise_if_beyond(size, maximum=maximum_size, exception_class=UncompressedSizeOverflowError)
yield from _(chunk)

if actual_crc_32 != crc_32:
raise CRC32IntegrityError()

if size != uncompressed_size:
raise UncompressedSizeIntegrityError()

for name, modified_at, mode, method, chunks in files:
_method, _auto_upgrade_central_directory, _get_compress_obj = method(offset, get_compressobj)
method = \
__NO_COMPRESSION_BUFFERED_32 if method is NO_COMPRESSION_32 else \
__NO_COMPRESSION_BUFFERED_64 if method is NO_COMPRESSION_64 else \
method
_method, _auto_upgrade_central_directory, _get_compress_obj, uncompressed_size, crc_32 = method(offset, get_compressobj)

name_encoded = name.encode('utf-8')
_raise_if_beyond(len(name_encoded), maximum=0xffff, exception_class=NameLengthOverflowError)
Expand All @@ -406,17 +546,19 @@ def _chunks():
data_func = \
_zip_64_local_header_and_data if _method is _ZIP_64 else \
_zip_32_local_header_and_data if _method is _ZIP_32 else \
_no_compression_64_local_header_and_data if _method is _NO_COMPRESSION_64 else \
_no_compression_32_local_header_and_data
_no_compression_64_local_header_and_data if _method is _NO_COMPRESSION_BUFFERED_64 else \
_no_compression_32_local_header_and_data if _method is _NO_COMPRESSION_BUFFERED_32 else \
_no_compression_streamed_64_local_header_and_data if _method is _NO_COMPRESSION_STREAMED_64 else \
_no_compression_streamed_32_local_header_and_data

central_directory_header_entry, name_encoded, extra = yield from data_func(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, _get_compress_obj, evenly_sized(chunks))
central_directory_header_entry, name_encoded, extra = yield from data_func(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, evenly_sized(chunks))
central_directory_size += len(central_directory_header_signature) + len(central_directory_header_entry) + len(name_encoded) + len(extra)
central_directory.append((central_directory_header_entry, name_encoded, extra))

zip_64_central_directory = zip_64_central_directory \
or (_auto_upgrade_central_directory is _AUTO_UPGRADE_CENTRAL_DIRECTORY and offset > 0xffffffff) \
or (_auto_upgrade_central_directory is _AUTO_UPGRADE_CENTRAL_DIRECTORY and len(central_directory) > 0xffff) \
or _method in (_ZIP_64, _NO_COMPRESSION_64)
or _method in (_ZIP_64, _NO_COMPRESSION_BUFFERED_64, _NO_COMPRESSION_STREAMED_64)

max_central_directory_length, max_central_directory_start_offset, max_central_directory_size = \
(0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff) if zip_64_central_directory else \
Expand Down Expand Up @@ -491,6 +633,18 @@ class ZipValueError(ZipError, ValueError):
pass


class ZipIntegrityError(ZipValueError):
pass


class CRC32IntegrityError(ZipIntegrityError):
pass


class UncompressedSizeIntegrityError(ZipIntegrityError):
pass


class ZipOverflowError(ZipValueError, OverflowError):
pass

Expand Down
Loading

0 comments on commit 6c6db48

Please sign in to comment.