Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add strict_timestamps option #117

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

frederikaalund
Copy link

I ran into an issue today with a device of ours that had a file from 1970 (or at least that was the timestamp). In turn, stream-zip gave a cryptic error. Turns out, the ZIP file format simply does not support timestamps from before 1980. This is a known issue and python's built-in zipfile module provides the strict_timestamps option as a workaround for this.

In short, this PR:

  • Adds strict_timestamps as an optional argument to stream_zip. It defaults to true (replicating the current behaviour; no breaking changes).
  • When you set strict_timestamps=False then the library coerces any modified_at timestamps into [1980; 2108). This silently avoids underflow/overflow.

The CPython zipfile implementation inspired my implementation in this PR.

Let me know what you think!

Side note: Thank you for this great library (along with stream-unzip). 👍 It's crucial for the backup/restore feature on our devices. :)

@michalc
Copy link
Member

michalc commented Apr 12, 2024

Ah thanks for the PR!

So probably yes, should do something about this... But...

  1. I'm leaning to not having an option at all and just "make it work" so earlier/later timestamps should in all cases be clamped so the zip file is created (Unless we can think of a use case where an error is desirable? Maybe some cases do need some validation for not-too-early-or-too-late timestamps, but I suspect that shouldn't be part of stream-zip)
  2. And I'm leaning to not clamping the extended/unix timestamp if there is one, and only ever clamp the MS DOS one. I'm guessing a bit, but I think it would mean that clients that support the extended timestamps would then still in all cases use the right timestamp.
  3. And leaning to having a bit of documentation on this behaviour somewhere, i.e. for older clients or all clients when extended_timestamps=False, the dates are clamped due to a limitation of the ZIP format

And also - I think would be good to add or change an existing tests around this, checking cases above and below the clamping range for both with and without the extended timestamps somehow.

@michalc
Copy link
Member

michalc commented Apr 20, 2024

@frederikaalund Let us know if you want to do the above / discuss more? No is all good of course, but in that case I would be tempted for us to make the changes at our end. I'm seeing this PR as essentially a bug report and I'm keen to get it sorted.

@michalc
Copy link
Member

michalc commented Apr 20, 2024

To just think out loud, I wonder also if the extended/unix timestamps need to be clamped for times outside of the range representable by a signed 32 bit integer, e.g. after 2038...

@frederikaalund frederikaalund changed the title feat: add struct_timestamps option feat: add strict_timestamps option Apr 21, 2024
@frederikaalund
Copy link
Author

Thanks for the feedback and I apologize for me late reply. It's been a busy week :)

I lean towards (1) too and only decided to not implement it that way since it's technically a breaking change. If you're okay with this subtle change then I'll gladly implement it without the kwarg option. 👍 It is an edge case after all.

If we go down this route, I too think that it's a good idea to clamp the timestamps in general (regardless of whether we store them in the MSDOS header or the Extended timestamp header). Something like this:

  • MSDOS timestamp header: Silently clamp to [1980; 2108)
  • Extended timestamp header: Silently clamp to [1902; 2038)

As for tests and docs, of course, let me add that as well as part of this PR. :) I would ask for the same thing myself so I just shamelessly tried to get away with not doing it 😅

Let me know what you think. Thanks again for your feedback on this PR!

@michalc
Copy link
Member

michalc commented Apr 21, 2024

If you're okay with this subtle change then I'll gladly implement it without the kwarg option. 👍 It is an edge case after all.

I think I am - it's the better behaviour I would say for my use cases, and it sounds like yours too, and it's simpler to not have an option. Avoiding a breaking change I don't think is worth it really.

Something like this:
MSDOS timestamp header: Silently clamp to [1980; 2108)
Extended timestamp header: Silently clamp to [1902; 2038)

Yes exactly (and I might lean to the making the range are large as possible i.e. clamping to exactly the largest/smallest second that's representable for each)

@frederikaalund frederikaalund force-pushed the feat-strict-timestamps branch from f8a9d51 to 15139e3 Compare May 4, 2024 10:30
@frederikaalund
Copy link
Author

frederikaalund commented May 4, 2024

Hi again, finally found some time to work on this. :)

I just pushed a rework that silently clamps the timestamps to the available precision in the MS-DOS and UNIX headers.

The timestamp limits are:

  • MSDOS timestamp header: Silently clamp to [1980; 2108)
  • Extended timestamp header: Silently clamp to [1970; 2038)

Note that I now use a lower limit of 1970 (and not 1902 as previously discussed). Some tools (unzip in particular) has problems with negative timestamps. It's technically a bug in unzip but it's a rather popular tool so why not just abide.

I also added tests and docs. Let me know if you want anything changed 👍

@michalc
Copy link
Member

michalc commented May 6, 2024

Looks great! Having a bit of a nose around and think still - will leave comments in the code

If `extended_timestamps=True` (the default):

* Timestamps are clamped between 1970-1-1 and 2038-1-19 (both inclusive)
* Timestamps are rounded down with 1-second precision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the situation a bit more complex than this? If extended_timestamps=True, then it depends on the client - ones that read the extended timestamps will use the greater range, but older ones will still use the smaller range.

I wonder if it's better if the documentation should be split into "older clients" and "newer clients" (or something like that). The older clients in all cases use the smaller range, but the newer ones depend on extended_timestamps?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data in the file (er, in the stream) doesn't depend on the client, and I think that's all you can reasonably make statements about. What the client will do with it depends on the client, but that doesn't change anything about the stream or the values encoded in it. And documenting what will be done with that data by all possible clients seems both out of scope and impossible.

I guess you could say something like this, to be more precise:

If `extended_timestamps=True` (the default), a second
timestamp field is added which will be read by
supporting clients. Extended timestamps:

* Use UNIX-style "seconds since epoch" values
* Are clamped between 1970-1-1 and 2038-1-19
  (both inclusive)
* Are rounded down with 1-second precision

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Perhaps first documenting the standard timestamps without the "if extended_timestamps=False" qualifier, since those timestamps will always be there.)

],
)
def test_unzip_modification_time(method, timezone, modified_at):
def test_unzip_modification_time(method, timezone, modified_at, expected_modified_at):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just running the tests locally, and I'm getting some errors... I wonder if it's because I'm currently in British Summer Time, 1 our off from UTC...

pytest 'test_stream_zip.py::test_unzip_modification_time[UTC+0-modified_at0-expected_modified_at0-ZIP_32]'

Results in

>           assert os.path.getmtime('my_file') == int(expected_modified_at.timestamp())
E           AssertionError: assert 0.0 == -3600
E            +  where 0.0 = <function getmtime at 0x101338040>('my_file')
E            +    where <function getmtime at 0x101338040> = <module 'posixpath' (frozen)>.getmtime
E            +      where <module 'posixpath' (frozen)> = os.path
E            +  and   -3600 = int(-3600.0)
E            +    where -3600.0 = <built-in method timestamp of datetime.datetime object at 0x1040fb990>()
E            +      where <built-in method timestamp of datetime.datetime object at 0x1040fb990> = datetime.datetime(1970, 1, 1, 0, 0).timestamp

The tests test_unzip_modification_time do pass for me.

Not quite sure why right now...

Copy link

@ferdnyc ferdnyc May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this jszip issue, convention is that MS-DOS timestamps are in local time, whereas extended timestamps use UNIX conventions which means all timestamps are values in UTC. That's backed up by a note in the unzip(1) man page, as well:

-f freshen existing files, i.e., extract only those files that al‐ ready exist on disk and that are newer than the disk copies. By default unzip queries before overwriting, but the -o option may be used to suppress the queries. Note that under many operating systems, the TZ (timezone) environment variable must be set cor‐ rectly in order for -f and -u to work properly (under Unix the variable is usually set automatically). The reasons for this are somewhat subtle but have to do with the differences between DOS-format file times (always local time) and Unix-format times (always in GMT/UTC) and the necessity to compare the two. A typical TZ value is ``PST8PDT'' (US Pacific time with automatic adjustment for Daylight Savings Time or ``summer time'').

Both os.path.getmtime() and datetime.timestamp() seem to respect that.

# These produce the same value
datetime.datetime.now().timestamp()
datetime.datetime.now(datetime.UTC).timestamp()

# as do these
datetime.datetime.fromtimestamp(os.path.getmtime('some-file')).timestamp()
datetime.datetime.fromtimestamp(os.path.getmtime('some-file'), datetime.UTC).timestamp()

But a datetime.datetime(1970, 1, 1, 0, 0) with no timezone information will be interpreted as a local time, so the timestamp will have a timezone offset. IOW, here in NYC at UTC-5:

>>> dt = datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=None)
>>> dt.timestamp()
18000.0
>>> dtutc = datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.UTC)
>>> dtutc.timestamp()
0.0

# In principle, we the lower limit should be `-2**31` but we set it
# to zero to avoid issues with common zip utilities like `unzip`.
# Said tools do not correctly interpret negative timestamps.
max(min(int(modified_at.timestamp()), 2**31 - 1), 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From (admittedly only brief tests) it looks like once times go below 1970 with extended timestamps, then unzip seems to clamp to 1980 (so probably using the MS DOS timestamp?). While odd certainly, I guess I'm leaning to going for the full range in the zip file, because the timestamp before 1970 will be wrong in unzip no matter what, but we can at least make timestamps before 1970 right for other clients.

Comment on lines 1092 to +1094
subprocess.run(['unzip', f'{d}/test.zip', '-d', d], env={'TZ': timezone})

assert os.path.getmtime('my_file') == int(modified_at.timestamp())
assert os.path.getmtime('my_file') == int(expected_modified_at.timestamp())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failures seem to be caused by the unzip being run with the given timezone variable, but expected_modified_at.timestamp() being interpreted as local time when it has no tzinfo set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants