-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization #27
Comments
We've slowly discussing this issue, but it's obviously not as easy as it looks like. Probably it will be a major feature for something like second major version (v2.0 or something like that), so it'll eventually be there, but don't hold your breath for it. |
What are the troubles? |
It is pretty easy for simple fixed (C-style) structures. However, as soon as you start using seq:
- id: num_files
type: u4
- id: files
type: file
repeat: expr
repeat-expr: num_files
types:
file:
seq:
- id: file_ofs
type: u4
- id: file_size
type: u4
instances:
body:
pos: file_ofs
size: file_size This .ksy describes very simple file index structure which consists of (num_files) of (file_ofs, file_size) pairs. Each pair describes a "file", which can be accessed by an index using something like:
However, adding a file to such archive is a challenge. Ideally, one would want to do something like:
This should set relevant bound What's even harder is that in many cases (i.e. archive files, file systems, etc) you don't want rewrite whole file, but just do some changes (i.e. appending new file to the end of archive, or reusing some pre-reserved padding, or something like that). Another, may be even simpler example: if you read PNGs, you don't care about checksums. When you write PNGs, you have to generate proper checksums for every block — thus we need block checksumming algorithms and some way to bind it to block. |
what's the problem to deserialize them, edit, serialize back and then write? |
What exactly do you refer as "them"? File archive example? PNG checksums? |
Almost anything, including files and checksums. In your example Now some ideas how to implement this. When processing a KS file it should build a graph of object dependencies, in your case If you want to minimize count of writing operations you store both versions in memory, create diff, and trynot to touch parts not touched by diff. |
Exactly my thoughts. And actually even that requires us to create some sort of inverse derivation engine. For example, if we have a binding: - id: body_size
type: u4
- id: body
size: body_size * 4 + 2 and we update
I'm sorry, but I fail to understand that. Could you rephrase it, provide some examples, or explain why it is helpful, i.e. which problem we're trying to solve here?
The point is that we need to add some extra syntax (or API, or something) to make it possible to do custom space allocation or stuff like that. For example, if you're editing a file system, you can't just always append stuff to the end of it: block device usually has a finite capacity and sooner or later you'll exhaust that and your choice would be to pick and reuse some free blocks in the middle of the block device. You seem to have good ideas on implementation, would you want to join and help implementing it? Basically anything will help, i.e. .ksy syntax ideas, API ideas, tests for serialization, compiler ideas & code, etc, etc. |
Or take some library. Symolic evaluation is rather studied area of math with lots of paper written. I haven't studied it, so it is possible tha some of the ideas I've mentioned were discussed in them.
Let we have an array and a size_t scalar - number of elements in array. What is main in the couple? What defines it entirely? Data in the array do, number is needed only to allow us to read the array correctly. For example in c-strings you don't need number because the convention is to terminate by \0. Add more data to array and you'll have to increase array capacity which means you will have to increase number in order to read it (and everything after it) correctly. So let array.length have strentgh 2, and scalar 1. Then you have link "files[].length <-> num_files". Let we have another array "filenames[]" and its capacity Now we start processing. Let show strenghs in brackets, absolute first. then we need to serialize and read config
it determines the order we should evaluate expressions and what expressions depends on what and helps to find conflicts. and I think you should really read something about symbolic evaluation (I haven't, ideas above are just adhoc thoughts, maybe there are better approaches to it).
Sorry, no. Maybe i'll send you some ideas or code later, but I can't be a permanent member of this project.
OK. |
Ok, then what shall we do in case of the following: seq:
- id: num_objs
type: u4
- id: headers
type: header
repeat: expr
repeat-expr: num_objs
- id: footers
type: footer
repeat: expr
repeat-expr: num_objs This implies that |
1 see above example ps graph evaluation is done by KS compiler and I fixed priorities to match the lines I made supposing strenghs were 1 and 2 |
I've commited very basic PoC code that demonstrated Obviously, only a few tests were converted, and, to be frank, now only a very basic set of types is supported. Even strings are not implemented right now. Testing is very basic too, one can run I'd really love to hear any opinions on the API (both runtime & generated), Java implementation (it really pain in the ass, as ByteBuffer does not grow, and you have to preallocate the array, or probably reimplement everything twice with something that grows), etc. |
Strings work, repetitions work, enums work, things are slowly coming to reality :) |
Serialization progresses slowly. Basic in-stream user types and processing on byte types work, various fancy byte type stuff like I've got an idea that might be very simple to implement. Step 1: manual checksGenerate seq:
- id: len_of_1
type: u2
- id: str1
type: str
size: len_of_1 * 2 + 1 would generate: public void _read() {
this.lenOf1 = this._io.readU2le();
this.str1 = new String(this._io.readBytes(lenOf1() * 2 + 1), Charset.forName("ASCII"));
}
public void _write() {
this._io.writeU2le(this.lenOf1);
this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
}
public void _check() {
if (this.str1.bytes().size() != lenOf1() * 2 + 1)
throw new FormatConsistencyError("str1 size", this.str1.bytes().size(), lenOf1() + 3);
} To use this properly, one must manually set both r.setStr1("abcde");
r.setLenOf1(2);
r._check(); // should pass, so we're clean to take off
r._write(); // should write consistent data that's guaranteed to be readable back Step 2: dependent variablesWe declare some fields as "dependent", and mark them up in ksy: seq:
- id: len_of_1
type: u2
dependent: (str1.to_b("ASCII").size - 1) / 2
- id: str1
type: str
size: len_of_1 * 2 + 1 This means that public void _write() {
this.lenOf1 = (str1().getBytes(Charset.forName("ASCII")).size() - 1) / 2;
this._io.writeU2le(this.lenOf1);
this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
} Obviously, using this boils down to single Any comments / ideas / etc? |
|
You mean, something like that? public void _read() {
this.lenOf1 = this._io.readU2le();
this.str1 = new String(this._io.readBytes(_sizeStr1()), Charset.forName("ASCII"));
}
public void _write() {
this._io.writeU2le(this.lenOf1);
this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
}
public void _check() {
if (this.str1.bytes().size() != _sizeStr1())
throw new FormatConsistencyError("str1 size", this.str1.bytes().size(), lenOf1() + 3);
}
public int _sizeStr1() {
return lenOf1() * 2 + 1;
} Does it bring any benefits? It won't really simplify generation (probably on the contrary), and we'll need to invent tons of names for all that
Naming is, of course, still to be discussed. One argument against
Using I guess we could try to do some sort of
First of all, you can't really eliminate them completely in any case. Some functions are just irreversible, and in some cases you'll have more free variables than constraints. For example: seq:
- id: a
type: u1
- id: b
type: u1
- id: my_str
type: str
size: a + b Even if you have byte size of |
Yes, that's why I've chosen it. It has the same semantic: deriving a value of a struct member from other ones makes it some kind of an
Yes, we have already discussed this.
Since they are of equal strentgh. This should cause ksc to throw a warning about undefined behavior.
Not only. One expression can contradict another one, it can cause nasty errors ot can be used as a backdoor. Another side is that we (humans) don't know exact expressions ahead of time. So I propose the following:
Maybe. |
I would like to play with the serialization branch... Is there a pre-built compiler and Java runtime of the serialization branch available? If not, that's ok, I'll setup a Scala build environment. |
@ixe013 There are no pre-built packages for such experimental branches, but it's relatively straightforward to try it yourself. You don't really need anything beyond sbt — it would download all the required stuff (including Scala compiler and libraries of the versions needed) automatically. See dev docs — it's literally one command to run, like |
I was able to build a Read-write version of my ksy, however the generaed code does not compile as the Reader reads from Int and stores them in Enums as Long (Int->Long) and the _writes() try to write the Enum Long as Int (Long->Int cause error). I have these errors: |
I also have errors in every constructor, where it tries to assign _parent received as a KaitaiStruct in the constructor signature to the _parent without casting to the more specific type. Type mismatch: cannot convert from KaitaiStruct to KaitaiStruct.ReadWrite |
@glenn-teillet There's still lots of work to do, it's nowhere near production-ready. Before trying to compile read-life ksy files with lots of complicate details, probably we should spend some time getting read-write test infrastructure to work, and, probably, porting that older serialization branch code to modern codebase. |
Some suggested definitions for the serialization spec:
Serialization is the part of KSC producing serialization programs.
Comments? |
This is a very interesting subject. What is the current status of this? Will this be pursued in the future? It would be very good if you could use serialization. Especially for binary formats based on bit's this would be a great relief. I hope that the issue will be pursued further and that we can also benefit from it in the future. |
No updates here, latest comments like #27 (comment) still reflect current state of affairs. Help is welcome :) |
Typo fix: "slighly" --> "slightly"
Is there any plans to merge the initial serialization support anytime soon? It's though to base future work on such an old parallel branch. |
It’d be really nice to see any motion on this. I think even having pretty broken serialization would be good for a start… |
I seriously can't understand why this bug is still open after more then 5 years. If you have any support, you get small PRs that improve the behaviour because people need certain features. Expecting perfect support in the beginning is unlikely, but still no support at all. |
The problem is simple: while there has been some work done, I don’t think there is anyone actively working on this. It’s worth understanding that a huge use case for Kaitai is in dissecting data and reverse engineering, where you indeed would not care about serialization. It’d be nice to get things moving, but even if some partial implementation would be acceptable, nobody is working on pushing one forward at the moment. So I guess that would be the first problem to address, if any. Open source isn’t really a competition, though. If you happen to know of a good alternative that fills this niche, it would be nice to know. On my end, I have a Go library that sort-of overlaps, but aside from being limited to Go, it, too, could use some more maintenance… |
Kaitai's compiler does have a target for Python's Construct library, which does support both serialization and deserialization. I have several open PRs for that compiler target to bring it up to parity with other targets. There are some KSY mechanisms it doesn't quite support yet, but for basic use cases it does work. Of course, that does also require using Python. |
I've been taking notes on various things about the implementation of serialization - in the event that someone wants to read them, here they are: https://gist.github.com/generalmimon/fc22e97faf1fe4b4edc8279b0caa152d |
Are you sure we want things like mixing read/write? What I'd target, at least as first version, is to just have a "write to file" functionality. Might not be perfectly efficient, but a lot simpler. Also, not sure if I missed that in your document, what about structures that are pointed to from different locations? That's why I originally opted for a two pass writing, so you get a change to update the data with real offsets from the dry run before it's actually written. |
I don't know what you're referring to. Can you be more specific and perhaps quote some parts of the document to ensure that we're on the same page?
What do you think I am going for after reading the document? "Fill the classes and then give the main class the instruction to write itself to file/stream" sounds exactly how I intend the resulting serialization-capable classes to be used. I'm just discussing what happens when you give the main class the instruction and exactly how it's done under the hood.
Once again, I have no clue what you mean. This can mean so many things. "Pointed to" is like when you have an positional instance ( |
Guess I missed the later part
Ignore that, please
Seeing
it looked like it's written as soon the functions are called.
Take the following dummy file:
And now assume pos_chunk_2 has to be the absolute address (from beginning of file) where the second chunk starts. For reading no problem, but for writing we need to set the value correctly. In this case it's simple, but it can get a lot more complicated. I currently don't see how we can do that using a single writing pass. |
I guess:
Another issue: we sometimes use |
Either that, or give the user the ability to update variables with position info before they're written. First calculate all target positions, then let the user update the structure again. Seems easier than implementing a hint system, anyways. |
Sorry, this was really just an incomplete snippet in a rant about whether or not keep the If you want to look at a bit more complex use case what the serialization API looks like, see test
My design has a simple answer to all sizes and offsets - calculated and set by the user, checked by KS-generated code (well, if EOF errors count as checking, but I guess they do; larger sizes than needed are always valid). Also, all streams have a fixed size (including the root stream), i.e. there is no "growable" stream. This is very likely to change in the future to make the serialization process more user-friendly, but I would insist on keeping it for the early version of serialization, because it makes many things simpler (from the perspective of KS compiler, not the user, of course). And yet there are quite a few things that are problematic that you don't even know about, but I have already dealt with many of them. You'd think that having to set sizes of everything as a user is unmanageable, but it's actually easy - you do it gradually just for the direct fields of the type you're currently processing, and delegate lengths that depend on child types to their respective methods that fill each one of them: private void Build()
{
var cdr = new CoreldrawCdr(new List<KaitaiStream>(), null);
cdr.RiffChunk = new CoreldrawCdr.RiffChunkType(null, cdr, cdr.M_Root);
var lenFile = Fill(cdr.RiffChunk);
;
}
private uint Fill(CoreldrawCdr.RiffChunkType chunk)
{
chunk.ChunkId = new[] { (byte)'R', (byte)'I', (byte)'F', (byte)'F' };
chunk.Body = new CoreldrawCdr.CdrChunkData(null, chunk, chunk.M_Root);
chunk.LenBody = Fill(chunk.Body);
chunk.PadByte = GetPadByte(chunk.LenBody);
return
(uint)chunk.ChunkId.Length + // chunk_id
4 + // len_body
chunk.LenBody + // body
(uint)chunk.PadByte.Length // pad_byte
;
}
private uint Fill(CoreldrawCdr.CdrChunkData chunkData)
{
// ... You can also use this to calculate offsets, reserve space for So I don't have any concerns about it being particularly limiting. Lots of people in this thread seemed like they would be happy with any simple, stupid solution, and I'm here to provide it.
I've already thought of that and resolved it: you can individually disable which positional public void setHeader_ToWrite(boolean _v) { _toWriteHeader = _v; } I didn't know a better name for it, but it means "whether to write the instance I encourage you to explore the I'm currently working on substreams with |
Yeah, me too! Thanks for your work and the explanation, IMHO it will get a lot easier when you have a solid foundation! |
Serialization for Java is done. I started a documentation page for how to use it: https://doc.kaitai.io/serialization.html |
Let me break out a separate discussion with a checklist of what needed to merge serialization into master: #1060 |
Thanks for all the hard work! I'm super interested in getting some implementation for C++, is there an existing effort toward that goal or anything I could help with ? |
I'm a little frustrated using the serialization for Java, everything looked great at first, but when I add the CustomProcessor for encryption and decryption, I got many The generated code created a lot of What is the correct way to set For now, I'm implementing a |
Another problem, the generated java code has below behavior: public void _write_Seq() {
this._io.writeS4be(lenPayload());
Compressor _process_payload = newCompressor();
this._raw_payload = _process_payload.encode(payload());
// It's basically impossible to predict the length of the data and set lenPayload correctly before performing compression.
if ((_raw_payload().length != lenPayload()))
throw new ConsistencyError("payload", _raw_payload().length, lenPayload());
this._io.writeBytes(_raw_payload());
} @generalmimon Could you please give me some advice? |
@hu-chia First of all, thanks for your feedback!
Your observations are correct. As I've already hinted in some of previous comments, the serialization feature deals only with serialization itself so far - it expects a fully populated object tree, including lengths, offsets, CRC-32 checksums and whatnot, which must be given by the user in advance. No auto-deriving of values of auxiliary fields takes place yet 1. Also, as you correctly note in your second comment, as soon as you call To your specific questions:
The
Yes, this is a bit tricky. Basically you have to make a test write of the contents that will be compressed, compress it and see how many bytes you get. Let me demonstrate the suggested solution at an example: meta:
id: zlib_serialization
endian: le
seq:
- id: len_body
type: u4
- id: body
size: len_body
type: contents
process: zlib
types:
contents:
seq:
- id: foo
type: s4
- id: bar
type: f4 One feasible way to serialize this is as follows (using the same pattern as in #27 (comment)): private static byte[] build() throws IOException {
ZlibSerialization r = new ZlibSerialization();
r.body = new ZlibSerialization.Contents(null, r, r._root());
r.setBody_InnerSize(fill(r.body));
// test write
byte[] bodyBuf = new byte[r.body_InnerSize()];
try (KaitaiStream io = new ByteBufferKaitaiStream(bodyBuf)) {
r.body._write(io);
}
r.lenBody = KaitaiStream.unprocessZlib(bodyBuf).length;
r._check();
int lenFile =
4 + // len_body
(int)r.lenBody // body
;
byte[] output = new byte[lenFile];
try (KaitaiStream io = new ByteBufferKaitaiStream(output)) {
r._write(io);
}
return output;
}
private static int fill(ZlibSerialization.Contents cont) {
cont.setFoo(-4);
cont.setBar(1.5f);
cont._check();
return
4 + // foo
4 // bar
;
}
private static String byteArrayToHex(byte[] arr) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < arr.length; i++) {
if (i > 0)
sb.append(' ');
sb.append(String.format("%02x", arr[i]));
}
return sb.toString();
}
public static void main(String[] args) throws IOException {
System.out.println(byteArrayToHex(build()));
// output: 10 00 00 00 78 9c fb f3 ff ff 7f 06 86 03 f6 00 1b 95 04 f9
} For CRC-32 checksums, for example, we can use the same approach, just instead of checking how many compressed bytes we get, we calculate the CRC-32 checksum and save it in the object tree so that it's available once it comes to the actual serialization. Of course, this requires the processing routine to be deterministic. Footnotes
|
I guess there would be two compressions here( include the process in generated code ), am I right? r.lenBody = KaitaiStream.unprocessZlib(bodyBuf).length;
r._check(); My solution now is to pass in the compressed/encrypted data when constructing the object tree, and provide an empty implementation for the encode method of |
You have deserialization. How about serialization?
The text was updated successfully, but these errors were encountered: