Skip to content

Commit

Permalink
feat: dynamic mode to stream write uncompressed files
Browse files Browse the repository at this point in the history
  • Loading branch information
michalc committed Jul 16, 2023
1 parent 3e0d7d0 commit ec92583
Show file tree
Hide file tree
Showing 4 changed files with 243 additions and 9 deletions.
12 changes: 12 additions & 0 deletions docs/exceptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,18 @@ Exceptions raised by the source iterables are passed through the `stream_zip` fu

Base class for errors relating to invalid arguments

- **ZipIntegrityError**

An integrity check failed

- **CRC32IntegrityError**

The CRC32 calculated from data did not match the CRC32 passed into the method

- **UncompressedSizeIntegrityError**

The uncompressed size of data did not match the uncompressed size passed into the method

- **ZipOverflowError** (also inherits from the **OverflowError** built-in)

The size or positions of data in the ZIP are too large to store using the requested method
Expand Down
14 changes: 9 additions & 5 deletions docs/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,30 @@
Each member file of the ZIP is compressed with one following methods.


## ZIP_32, NO_COMPRESSION_32
## ZIP_32, NO_COMPRESSION_32, NO_COMPRESSION_32(uncompressed_size, crc_32)

These methods are the historical standard methods for ZIP files.

`ZIP_32` compresses the file by default, but it is affected the `get_compressobj` parameter to `stream_unzip`. For example, by passing `get_compressobj=lambda: zlib.compressobj(wbits=-zlib.MAX_WBITS, level=0)`, the `level=0` part would result in this file not being compressed. Its size would increase slightly due to overhead of the underlying algorithm.

`NO_COMPRESSION_32` does not compress the member file, and is not affected by the `get_compressobj` parameter to `stream_unzip`. However, its entire contents is buffered in memory before output begins, and so should not be used for large files. It size does not increase - in the final ZIP file the contents of each `NO_COMPRESSION_32` member file is present byte-for-byte.
Both `NO_COMPRESSION_32` and `NO_COMPRESSION_32(uncompressed_size, crc_32)` store the contents of the file in the ZIP uncompressed exactly as supplied, and are not affected by the `get_compressobj` parameter to `stream_unzip`.

Each member file is limited to 4GiB (gibibyte). This limitation is on the uncompressed size of the data, and (if `ZIP_32`) the compressed size of the data, and how far the start of the member file is from the beginning in the final ZIP file. A `ZIP_32` or `NO_COMPRESSION_32` file can also not be later than the 65,535th member file in a ZIP. If a file only has `ZIP_32` or `NO_COMPRESSION_32` members, the entire file is a Zip32 file, and end of the final member file must be less than 4GiB from the beginning of the final ZIP. If these limits are breached, a `ZipOverflowError` will be raised.
For `NO_COMPRESSION_32` the entire contents are buffered in memory before output begins, and so should not be used for large files. For `NO_COMPRESSION_32(uncompressed_size, crc_32)` the contents are streamed, but at the price of having to determine the uncompressed size and CRC 32 of the contents beforehand. These limitations, although awkward when writing the ZIP, allow the ZIP file to be read in a streaming way.

Each member file using using one of these methods is limited to 4GiB (gibibyte). This limitation is on the uncompressed size of the data, and (if `ZIP_32`) the compressed size of the data, and how far the start of the member file is from the beginning in the final ZIP file. Also, each member file cannot be later than the 65,535th member file in a ZIP. If a file only has only these members, the entire file is a Zip32 file, and the end of the final member file must be less than 4GiB from the beginning of the final ZIP. If these limits are breached, a `ZipOverflowError` will be raised.

This has very high support. You can usually assume anything that can open a ZIP file can open ZIP files with only `ZIP_32` or `NO_COMPRESSION_32` members.


## ZIP_64, NO_COMPRESSION_64
## ZIP_64, NO_COMPRESSION_64, NO_COMPRESSION_64(uncompressed_size, crc_32)

These methods use the Zip64 extension to the original ZIP format.

`ZIP_64` compresses the file by default, but it is affected the `get_compressobj` parameter to `stream_unzip`. For example, by passing `get_compressobj=lambda: zlib.compressobj(wbits=-zlib.MAX_WBITS, level=0)`, the `level=0` part would result in this file not being compressed. However, its size would increase slightly due to overhead of the underlying algorithm.

`NO_COMPRESSION_64` does not compress the member file, and is not affected by the `get_compressobj` parameter to `stream_unzip`. However, its entire contents is buffered in memory before output begins, and so should not be used for large files. It size does not increase - in the final ZIP file the contents of each `NO_COMPRESSION_32` member file is present byte-for-byte.
Both `NO_COMPRESSION_64` and `NO_COMPRESSION_64(uncompressed_size, crc_32)` store the contents of the file in the ZIP uncompressed exactly as supplied, and are not affected by the `get_compressobj` parameter to `stream_unzip`.

For `NO_COMPRESSION_64` the entire contents are buffered in memory before output begins, and so should not be used for large files. For `NO_COMPRESSION_64(uncompressed_size, crc_32)` the contents are streamed, but at the price of having to determine the uncompressed size and CRC 32 of the contents beforehand. These limitations, although awkward when writing the ZIP, allow the ZIP file to be read in a streaming way.

Each member file is limited to 16EiB (exbibyte). This limitation is on the uncompressed size of the data, and (if `ZIP_64`) the compressed size of the data, and how far the member starts from the beginning in the final ZIP file. If these limits are breached, a `ZipOverflowError` will be raised.

Expand Down
162 changes: 158 additions & 4 deletions stream_zip.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,42 @@
from struct import Struct
import zlib

# Private methods

_NO_COMPRESSION_BUFFERED_32 = object()
_NO_COMPRESSION_BUFFERED_64 = object()
_NO_COMPRESSION_STREAMED_32 = object()
_NO_COMPRESSION_STREAMED_64 = object()
_ZIP_32 = object()
_ZIP_64 = object()

_AUTO_UPGRADE_CENTRAL_DIRECTORY = object()
_NO_AUTO_UPGRADE_CENTRAL_DIRECTORY = object()

def NO_COMPRESSION_32(offset, default_get_compressobj):
def __NO_COMPRESSION_BUFFERED_32(offset, default_get_compressobj):
return _NO_COMPRESSION_BUFFERED_32, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, None, None

def NO_COMPRESSION_64(offset, default_get_compressobj):
def __NO_COMPRESSION_BUFFERED_64(offset, default_get_compressobj):
return _NO_COMPRESSION_BUFFERED_64, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, None, None

def __NO_COMPRESSION_STREAMED_32(uncompressed_size, crc_32):
def method_compressobj(offset, default_get_compressobj):
return _NO_COMPRESSION_STREAMED_32, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, uncompressed_size, crc_32
return method_compressobj

def __NO_COMPRESSION_STREAMED_64(uncompressed_size, crc_32):
def method_compressobj(offset, default_get_compressobj):
return _NO_COMPRESSION_STREAMED_64, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, uncompressed_size, crc_32
return method_compressobj

# Public methods

def NO_COMPRESSION_32(uncompressed_size, crc_32):
return __NO_COMPRESSION_STREAMED_32(uncompressed_size, crc_32)

def NO_COMPRESSION_64(uncompressed_size, crc_32):
return __NO_COMPRESSION_STREAMED_64(uncompressed_size, crc_32)

def ZIP_32(offset, default_get_compressobj):
return _ZIP_32, _NO_AUTO_UPGRADE_CENTRAL_DIRECTORY, default_get_compressobj, None, None

Expand Down Expand Up @@ -379,7 +401,125 @@ def _chunks():

return chunks, size, crc_32

def _no_compression_streamed_64_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, chunks):
file_offset = offset

_raise_if_beyond(file_offset, maximum=0xffffffffffffffff, exception_class=OffsetOverflowError)

extra = zip_64_local_extra_struct.pack(
zip_64_extra_signature,
16, # Size of extra
uncompressed_size, # Uncompressed
uncompressed_size, # Compressed
) + mod_at_unix_extra
yield from _(local_header_signature)
yield from _(local_header_struct.pack(
45, # Version
b'\x00\x08', # Flags - utf-8 file names
0, # Compression - no compression
mod_at_ms_dos,
crc_32,
0xffffffff, # Compressed size - since zip64
0xffffffff, # Uncompressed size - since zip64
len(name_encoded),
len(extra),
))
yield from _(name_encoded)
yield from _(extra)

yield from _no_compression_streamed_data(chunks, uncompressed_size, crc_32, 0xffffffffffffffff)

extra = zip_64_central_directory_extra_struct.pack(
zip_64_extra_signature,
24, # Size of extra
uncompressed_size, # Uncompressed
uncompressed_size, # Compressed
file_offset,
) + mod_at_unix_extra
return central_directory_header_struct.pack(
45, # Version made by
3, # System made by (UNIX)
45, # Version required
0, # Reserved
b'\x00\x08', # Flags - utf-8 file names
0, # Compression - none
mod_at_ms_dos,
crc_32,
0xffffffff, # Compressed size - since zip64
0xffffffff, # Uncompressed size - since zip64
len(name_encoded),
len(extra),
0, # File comment length
0, # Disk number
0, # Internal file attributes - is binary
external_attr,
0xffffffff, # File offset - since zip64
), name_encoded, extra


def _no_compression_streamed_32_local_header_and_data(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, chunks):
file_offset = offset

_raise_if_beyond(file_offset, maximum=0xffffffff, exception_class=OffsetOverflowError)

extra = mod_at_unix_extra
yield from _(local_header_signature)
yield from _(local_header_struct.pack(
20, # Version
b'\x00\x08', # Flags - utf-8 file names
0, # Compression - no compression
mod_at_ms_dos,
crc_32,
uncompressed_size, # Compressed
uncompressed_size, # Uncompressed
len(name_encoded),
len(extra),
))
yield from _(name_encoded)
yield from _(extra)

yield from _no_compression_streamed_data(chunks, uncompressed_size, crc_32, 0xffffffff)

return central_directory_header_struct.pack(
20, # Version made by
3, # System made by (UNIX)
20, # Version required
0, # Reserved
b'\x00\x08', # Flags - utf-8 file names
0, # Compression - none
mod_at_ms_dos,
crc_32,
uncompressed_size, # Compressed
uncompressed_size, # Uncompressed
len(name_encoded),
len(extra),
0, # File comment length
0, # Disk number
0, # Internal file attributes - is binary
external_attr,
file_offset,
), name_encoded, extra

def _no_compression_streamed_data(chunks, uncompressed_size, crc_32, maximum_size):
actual_crc_32 = zlib.crc32(b'')
size = 0
for chunk in chunks:
actual_crc_32 = zlib.crc32(chunk, actual_crc_32)
size += len(chunk)
_raise_if_beyond(size, maximum=maximum_size, exception_class=UncompressedSizeOverflowError)
yield from _(chunk)

if actual_crc_32 != crc_32:
raise CRC32IntegrityError()

if size != uncompressed_size:
raise UncompressedSizeIntegrityError()

for name, modified_at, mode, method, chunks in files:
method = \
__NO_COMPRESSION_BUFFERED_32 if method is NO_COMPRESSION_32 else \
__NO_COMPRESSION_BUFFERED_64 if method is NO_COMPRESSION_64 else \
method
_method, _auto_upgrade_central_directory, _get_compress_obj, uncompressed_size, crc_32 = method(offset, get_compressobj)

name_encoded = name.encode('utf-8')
Expand Down Expand Up @@ -407,7 +547,9 @@ def _chunks():
_zip_64_local_header_and_data if _method is _ZIP_64 else \
_zip_32_local_header_and_data if _method is _ZIP_32 else \
_no_compression_64_local_header_and_data if _method is _NO_COMPRESSION_BUFFERED_64 else \
_no_compression_32_local_header_and_data
_no_compression_32_local_header_and_data if _method is _NO_COMPRESSION_BUFFERED_32 else \
_no_compression_streamed_64_local_header_and_data if _method is _NO_COMPRESSION_STREAMED_64 else \
_no_compression_streamed_32_local_header_and_data

central_directory_header_entry, name_encoded, extra = yield from data_func(name_encoded, mod_at_ms_dos, mod_at_unix_extra, external_attr, uncompressed_size, crc_32, _get_compress_obj, evenly_sized(chunks))
central_directory_size += len(central_directory_header_signature) + len(central_directory_header_entry) + len(name_encoded) + len(extra)
Expand All @@ -416,7 +558,7 @@ def _chunks():
zip_64_central_directory = zip_64_central_directory \
or (_auto_upgrade_central_directory is _AUTO_UPGRADE_CENTRAL_DIRECTORY and offset > 0xffffffff) \
or (_auto_upgrade_central_directory is _AUTO_UPGRADE_CENTRAL_DIRECTORY and len(central_directory) > 0xffff) \
or _method in (_ZIP_64, _NO_COMPRESSION_BUFFERED_64)
or _method in (_ZIP_64, _NO_COMPRESSION_BUFFERED_64, _NO_COMPRESSION_STREAMED_64)

max_central_directory_length, max_central_directory_start_offset, max_central_directory_size = \
(0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff) if zip_64_central_directory else \
Expand Down Expand Up @@ -491,6 +633,18 @@ class ZipValueError(ZipError, ValueError):
pass


class ZipIntegrityError(ZipValueError):
pass


class CRC32IntegrityError(ZipIntegrityError):
pass


class UncompressedSizeIntegrityError(ZipIntegrityError):
pass


class ZipOverflowError(ZipValueError, OverflowError):
pass

Expand Down
64 changes: 64 additions & 0 deletions test_stream_zip.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import os
import stat
import subprocess
import zlib
from tempfile import TemporaryDirectory
from zipfile import ZipFile

Expand All @@ -17,6 +18,8 @@
ZIP_AUTO,
ZIP_64,
ZIP_32,
CRC32IntegrityError,
UncompressedSizeIntegrityError,
CompressedSizeOverflowError,
UncompressedSizeOverflowError,
OffsetOverflowError,
Expand Down Expand Up @@ -100,6 +103,67 @@ def files():
]


@pytest.mark.parametrize(
"method",
[
NO_COMPRESSION_32,
NO_COMPRESSION_64,
],
)
def test_with_stream_unzip_with_no_compresion_known_crc_32(method):
now = datetime.fromisoformat('2021-01-01 21:01:12')
mode = stat.S_IFREG | 0o600

def files():
yield 'file-1', now, mode, method(20000, zlib.crc32(b'a' * 10000 + b'b' * 10000)), (b'a' * 10000, b'b' * 10000)
yield 'file-2', now, mode, method(2, zlib.crc32(b'c' + b'd')), (b'c', b'd')

assert [(b'file-1', 20000, b'a' * 10000 + b'b' * 10000), (b'file-2', 2, b'cd')] == [
(name, size, b''.join(chunks))
for name, size, chunks in stream_unzip(stream_zip(files()))
]


@pytest.mark.parametrize(
"method",
[
NO_COMPRESSION_32,
NO_COMPRESSION_64,
],
)
def test_with_stream_unzip_with_no_compresion_bad_crc_32(method):
now = datetime.fromisoformat('2021-01-01 21:01:12')
mode = stat.S_IFREG | 0o600

def files():
yield 'file-1', now, mode, method(20000, zlib.crc32(b'a' * 10000 + b'b' * 10000)), (b'a' * 10000, b'b' * 10000)
yield 'file-1', now, mode, method(1, zlib.crc32(b'')), (b'a',)

with pytest.raises(CRC32IntegrityError):
for name, size, chunks in stream_unzip(stream_zip(files())):
pass


@pytest.mark.parametrize(
"method",
[
NO_COMPRESSION_32,
NO_COMPRESSION_64,
],
)
def test_with_stream_unzip_with_no_compresion_bad_size(method):
now = datetime.fromisoformat('2021-01-01 21:01:12')
mode = stat.S_IFREG | 0o600

def files():
yield 'file-1', now, mode, method(20000, zlib.crc32(b'a' * 10000 + b'b' * 10000)), (b'a' * 10000, b'b' * 10000)
yield 'file-1', now, mode, method(1, zlib.crc32(b'')), (b'',)

with pytest.raises(UncompressedSizeIntegrityError):
for name, size, chunks in stream_unzip(stream_zip(files())):
pass


def test_with_stream_unzip_auto_small():
now = datetime.fromisoformat('2021-01-01 21:01:12')
mode = stat.S_IFREG | 0o600
Expand Down

0 comments on commit ec92583

Please sign in to comment.