-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python Pickle format #150
Conversation
serialization/pickle.ksy
Outdated
- id: ops | ||
type: op | ||
repeat: eos | ||
# TODO is there a way to declare a trailing STOP is required? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work in combination with repeat: eos
though, right? (The array will always consume all of the data, so the contents
field will never have any data left and will always fail.)
If STOP
can only appear at the end of the data, you could use repeat: until
to terminate the array after the first STOP
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_io
magic is to the rescue. One can determine the stream size and create a substream containing all but the last bytes. So the last bytes are to be used for contents
and the rest is occupied by a substream containing repeat: eos
serialization/pickle.ksy
Outdated
They are deserialised by importing that constructor, and calling it. | ||
doc-ref: https://github.com/python/cpython/blob/3.3/Lib/pickletools.py | ||
seq: | ||
# TODO is there a way to declare PROTO is optional, but only valid at position 0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, but it would require some _io
hacks.
serialization/pickle.ksy
Outdated
encoding: ASCII | ||
terminator: 0x0a # "\n" | ||
doc: Integer, encoded with the ASCII chracters [0-9-], followed by 'L'. | ||
# TODO Can kaitai express constraint that these are quoted? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be difficult without something like kaitai-io/kaitai_struct#81.
These are not generally used, they were not intented to be in the initial commit.
This means efery parsed op will have the same attributes, and that every known opcode is explicitly tested for. So unknown opcodes are unambiguous. Explicit is better than implicit.
Also noticed that a positive exponent would be encoded as 1e+100, not 1e100.
By the way, PEP 574, which proposes version 5 of the Pickle protocol, has been accepted for Python 3.8. This may be interesting in the future, but right now it's way too early to support version 5 - it hasn't even been implemented in CPython yet, and the PEP is missing some important information, like concrete byte values for the new opcodes. |
Huh, thank you. Thought I'd seen that a a while back, and it had been rejected. Must have confused it with another proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks! In my perspective, it's ready to merge. @KOLANICH, @dgelessus — do you see any blockers, or I can go ahead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. 👍
There are still a couple of TODO comments in the spec, but I think we can leave them in as a reminder, in case Kaitai Struct supports the necessary features in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't mean to start a review, and I (as the submitter) can't close it
-orig-id: PROTO | ||
doc: identify pickle protocol | ||
0x81: | ||
id: "newobj" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the original names are preserved in -orig-id
how about changing the ids? For example adding words delimiters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand, do you mean something like id: new_object
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this id
, exactly this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I treat the type identifiers similarly? e.g. unicodestring8
-> unicode_string8
ETA For the current type identifiers I used the ArgumentDescriptor.name
attributes in pickletools.py
, e.g. https://github.com/python/cpython/blob/v3.7.3/Lib/pickletools.py#L702
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course you should!
For the current type identifiers I used the
ArgumentDescriptor.name
attributes inpickletools.py
We have -orig-id
for that, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have
-orig-id
for that, right?
That covers the opcode names in the enum. I was asking about the names of the argument types, further up. Those are of the form
unicodestring8:
seq:
id: len
...
from my reading of the KSY spec I could add an -orig-id
to len
, but not to unicodestring8
as a whole. Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any attr beginning from a dash can be added anywhere - they are ignored by KSC and they are for third-party tools.
Looks like vast majority of pressing issues was resolved, so I'll go ahead and merge this in. Thanks for this tremendous effort, everyone, especially @moreati! |
Remaining tasks