Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python Pickle format #150

Merged
merged 20 commits into from
May 28, 2019
Merged

Add Python Pickle format #150

merged 20 commits into from
May 28, 2019

Conversation

moreati
Copy link
Contributor

@moreati moreati commented May 19, 2019

Remaining tasks

  • Add protocol 4 types, opcodes etc

serialization/pickle.ksy Outdated Show resolved Hide resolved
- id: ops
type: op
repeat: eos
# TODO is there a way to declare a trailing STOP is required?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work in combination with repeat: eos though, right? (The array will always consume all of the data, so the contents field will never have any data left and will always fail.)

If STOP can only appear at the end of the data, you could use repeat: until to terminate the array after the first STOP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_io magic is to the rescue. One can determine the stream size and create a substream containing all but the last bytes. So the last bytes are to be used for contents and the rest is occupied by a substream containing repeat: eos

They are deserialised by importing that constructor, and calling it.
doc-ref: https://github.com/python/cpython/blob/3.3/Lib/pickletools.py
seq:
# TODO is there a way to declare PROTO is optional, but only valid at position 0?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is, but it would require some _io hacks.

encoding: ASCII
terminator: 0x0a # "\n"
doc: Integer, encoded with the ASCII chracters [0-9-], followed by 'L'.
# TODO Can kaitai express constraint that these are quoted?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be difficult without something like kaitai-io/kaitai_struct#81.

serialization/pickle.ksy Outdated Show resolved Hide resolved
serialization/pickle.ksy Outdated Show resolved Hide resolved
serialization/pickle.ksy Outdated Show resolved Hide resolved
serialization/pickle.ksy Outdated Show resolved Hide resolved
serialization/pickle.ksy Outdated Show resolved Hide resolved
serialization/pickle.ksy Outdated Show resolved Hide resolved
moreati added 7 commits May 20, 2019 21:26
These are not generally used, they were not intented to be in the
initial commit.
This means efery parsed op will have the same attributes, and that every
known opcode is explicitly tested for. So unknown opcodes are
unambiguous.

Explicit is better than implicit.
Also noticed that a positive exponent would be encoded as 1e+100, not
1e100.
serialization/python_pickle.ksy Outdated Show resolved Hide resolved
serialization/python_pickle.ksy Show resolved Hide resolved
serialization/python_pickle.ksy Outdated Show resolved Hide resolved
@dgelessus
Copy link
Contributor

By the way, PEP 574, which proposes version 5 of the Pickle protocol, has been accepted for Python 3.8. This may be interesting in the future, but right now it's way too early to support version 5 - it hasn't even been implemented in CPython yet, and the PEP is missing some important information, like concrete byte values for the new opcodes.

@moreati
Copy link
Contributor Author

moreati commented May 24, 2019

PEP 574, which proposes version 5 of the Pickle protocol, has been accepted for Python 3.8

Huh, thank you. Thought I'd seen that a a while back, and it had been rejected. Must have confused it with another proposal.

@GreyCat GreyCat self-requested a review May 24, 2019 14:11
Copy link
Member

@GreyCat GreyCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks! In my perspective, it's ready to merge. @KOLANICH, @dgelessus — do you see any blockers, or I can go ahead?

Copy link
Contributor

@dgelessus dgelessus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. 👍

There are still a couple of TODO comments in the spec, but I think we can leave them in as a reminder, in case Kaitai Struct supports the necessary features in the future.

Copy link
Contributor Author

@moreati moreati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't mean to start a review, and I (as the submitter) can't close it

serialization/python_pickle.ksy Outdated Show resolved Hide resolved
-orig-id: PROTO
doc: identify pickle protocol
0x81:
id: "newobj"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the original names are preserved in -orig-id how about changing the ids? For example adding words delimiters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, do you mean something like id: new_object?

Copy link
Contributor

@KOLANICH KOLANICH May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this id, exactly this.

Copy link
Contributor Author

@moreati moreati May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I treat the type identifiers similarly? e.g. unicodestring8 -> unicode_string8

ETA For the current type identifiers I used the ArgumentDescriptor.name attributes in pickletools.py, e.g. https://github.com/python/cpython/blob/v3.7.3/Lib/pickletools.py#L702

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you should!

For the current type identifiers I used the ArgumentDescriptor.name attributes in pickletools.py

We have -orig-id for that, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have -orig-id for that, right?

That covers the opcode names in the enum. I was asking about the names of the argument types, further up. Those are of the form

unicodestring8:
  seq:
    id: len
    ...

from my reading of the KSY spec I could add an -orig-id to len, but not to unicodestring8 as a whole. Did I miss something?

Copy link
Contributor

@KOLANICH KOLANICH May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any attr beginning from a dash can be added anywhere - they are ignored by KSC and they are for third-party tools.

@GreyCat
Copy link
Member

GreyCat commented May 28, 2019

Looks like vast majority of pressing issues was resolved, so I'll go ahead and merge this in. Thanks for this tremendous effort, everyone, especially @moreati!

@GreyCat GreyCat merged commit e28b3f1 into kaitai-io:master May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants