Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization #27

Open
KOLANICH opened this issue Sep 13, 2016 · 88 comments
Open

Serialization #27

KOLANICH opened this issue Sep 13, 2016 · 88 comments
Assignees

Comments

@KOLANICH
Copy link

KOLANICH commented Sep 13, 2016

You have deserialization. How about serialization?

@GreyCat GreyCat self-assigned this Sep 13, 2016
@GreyCat
Copy link
Member

GreyCat commented Sep 13, 2016

We've slowly discussing this issue, but it's obviously not as easy as it looks like. Probably it will be a major feature for something like second major version (v2.0 or something like that), so it'll eventually be there, but don't hold your breath for it.

@KOLANICH
Copy link
Author

We've slowly discussing this issue, but it's obviously not as easy as it looks like.

What are the troubles?

@GreyCat
Copy link
Member

GreyCat commented Sep 13, 2016

It is pretty easy for simple fixed (C-style) structures. However, as soon as you start using instances that bind certain values to offsets in the stream, it becomes much more complex. A very simple example:

seq:
  - id: num_files
    type: u4
  - id: files
    type: file
    repeat: expr
    repeat-expr: num_files
types:
  file:
    seq:
      - id: file_ofs
        type: u4
      - id: file_size
        type: u4
    instances:
      body:
        pos: file_ofs
        size: file_size

This .ksy describes very simple file index structure which consists of (num_files) of (file_ofs, file_size) pairs. Each pair describes a "file", which can be accessed by an index using something like:

file_contents = archive.files[42].body

However, adding a file to such archive is a challenge. Ideally, one would want to do something like:

archive.files[42].body = file_contents

This should set relevant bound file_size automatically to accomodate length of assigned file_contents and, what's much more complex, assign file_ofs somehow. This is not an easy task: KS has no innate knowledge on how to manage unmapped space in the stream, is it limited or not, should you find some unused spot and reuse it or just expand the stream and effectively append file_contents to its end, etc.

What's even harder is that in many cases (i.e. archive files, file systems, etc) you don't want rewrite whole file, but just do some changes (i.e. appending new file to the end of archive, or reusing some pre-reserved padding, or something like that).

Another, may be even simpler example: if you read PNGs, you don't care about checksums. When you write PNGs, you have to generate proper checksums for every block — thus we need block checksumming algorithms and some way to bind it to block.

@KOLANICH
Copy link
Author

what's the problem to deserialize them, edit, serialize back and then write?

@GreyCat
Copy link
Member

GreyCat commented Sep 13, 2016

What exactly do you refer as "them"? File archive example? PNG checksums?

@KOLANICH
Copy link
Author

KOLANICH commented Sep 13, 2016

What exactly do you refer as "them"? File archive example? PNG checksums?

Almost anything, including files and checksums. In your example
repeat-expr: num_files
and
id: num_files type: u4
means that "files has length num_files". When you deserialize, you read num_files, create files array of length num_files and read there num_files structures. When you serealize you need the inverse: update "num_files" with length and then write. What if "num_files" is complex expression we cannot derive automatically? We shift responsibility to a user and to be able serialize he must to provide both forward and inverse mappings for such expressions, except simple cases when they can be derived automatically.

Now some ideas how to implement this. When processing a KS file it should build a graph of object dependencies, in your case num_files <-> files[].length. Then we apply rules. In this case "scalar <-> array.length" -> "scalar <- array.length" which results in num_files <- files[].length. We can say that every field has some absolute and actual strengh and that edges can go only in direction of non-increasing actual strengh and that actual strengh of node is max(max(actual strengh of neighbours), absolute strengh). This way we transform the graph into tree and reduce the number of free variables in the case of equal strength. When you need to serialize you process the description member by member, launch lazy evaluation according to the graph and write them.

If you want to minimize count of writing operations you store both versions in memory, create diff, and trynot to touch parts not touched by diff.

@GreyCat
Copy link
Member

GreyCat commented Sep 13, 2016

What if "num_files" is complex expression we cannot derive automatically?

Exactly my thoughts. And actually even that requires us to create some sort of inverse derivation engine. For example, if we have a binding:

    - id: body_size
      type: u4
    - id: body
      size: body_size * 4 + 2

and we update body, we should update body_size, assigning (body.size() - 2) / 4 there. If there are some irreversible bindings (i.e. modulo, hashes, etc) - then, at the very least, we need to detect that situation and allow some extra syntax to make it possible for user to set these inverse dependencies manually.

We can say that every field has some absolute and actual strengh and that edges can go only in direction of non-increasing actual strengh and that actual strengh of node is max(max(actual strengh of neighbours), absolute strengh).

I'm sorry, but I fail to understand that. Could you rephrase it, provide some examples, or explain why it is helpful, i.e. which problem we're trying to solve here?

If you want to minimize count of writing operations you store both versions in memory, create diff, and trynot to touch parts not touched by diff.

The point is that we need to add some extra syntax (or API, or something) to make it possible to do custom space allocation or stuff like that. For example, if you're editing a file system, you can't just always append stuff to the end of it: block device usually has a finite capacity and sooner or later you'll exhaust that and your choice would be to pick and reuse some free blocks in the middle of the block device.

You seem to have good ideas on implementation, would you want to join and help implementing it? Basically anything will help, i.e. .ksy syntax ideas, API ideas, tests for serialization, compiler ideas & code, etc, etc.

@KOLANICH
Copy link
Author

KOLANICH commented Sep 13, 2016

And actually even that requires us to create some sort of inverse derivation engine.

Or take some library. Symolic evaluation is rather studied area of math with lots of paper written. I haven't studied it, so it is possible tha some of the ideas I've mentioned were discussed in them.

I'm sorry, but I fail to understand that. Could you rephrase it, provide some examples, or explain why it is helpful, i.e. which problem we're trying to solve here?

Let we have an array and a size_t scalar - number of elements in array. What is main in the couple? What defines it entirely? Data in the array do, number is needed only to allow us to read the array correctly. For example in c-strings you don't need number because the convention is to terminate by \0. Add more data to array and you'll have to increase array capacity which means you will have to increase number in order to read it (and everything after it) correctly. So let array.length have strentgh 2, and scalar 1. Then you have link "files[].length <-> num_files". Let we have another array "filenames[]" and its capacity num_files-2, let we also have foo and bar fields with something. Connections : filenames[].length <-> num_files, num_files <-> foo, bar <-> foo and files[].length <-> foo.

Now we start processing. Let show strenghs in brackets, absolute first.
1 files[].length (2,0)
2 files[].length
files[].length (2,2) itself is processed, go to edges
3 files[].length (2,2) <-> num_files(1,0)
4 files[].length (2,2) <-> num_files(1,1)
2>1 so removing reverse edge and setting to 2
5 files[].length (2,2) -> num_files(1,2)
6 files[].length (2,2) <-> foo(1,0)
7 files[].length (2,2) <-> foo(1,1)
8 files[].length (2,2) -> foo(1,1)
9 files[].length (2,2) -> foo(1,2)
10 num_files(1,2) <-> filenames.length (2,0)
11 num_files(1,2) <-> filenames.length (2,2)
equal, this saves both edges.
12 num_files (1,2) <-> foo(1,2)
13 foo(1,2) <-> bar(1,0)
14 foo(1,2) <-> bar(1,1)
15 foo(1,2) -> bar(1,1)
16 foo(1,2) -> bar(1,2)

then we need to serialize and read config
1 first goes the num_files
it has incoming edge files.length. it is only incoming edge.
2 files[].length is already evaluated, take its value, eval. num_files.
3 it has 2 bidi edges, filenames[].length and foo
evaluate them the same way and check if they match to value of filenum.
4 serialize and write
5 continue to the rest of fields.

explain why it is helpful, i.e. which problem we're trying to solve here?

it determines the order we should evaluate expressions and what expressions depends on what and helps to find conflicts.

and I think you should really read something about symbolic evaluation (I haven't, ideas above are just adhoc thoughts, maybe there are better approaches to it).

You seem to have good ideas on implementation, would you want to join and help implementing it?

Sorry, no. Maybe i'll send you some ideas or code later, but I can't be a permanent member of this project.

Basically anything will help, i.e. .ksy syntax ideas, API ideas, tests for serialization, compiler ideas & code, etc, etc.

OK.

@GreyCat
Copy link
Member

GreyCat commented Sep 13, 2016

Add more data to array and you'll have to increase array capacity which means you will have to increase number in order to read it (and everything after it) correctly. So let array.length have strentgh 1, and scalar 0.

Ok, then what shall we do in case of the following:

seq:
  - id: num_objs
    type: u4
  - id: headers
    type: header
    repeat: expr
    repeat-expr: num_objs
  - id: footers
    type: footer
    repeat: expr
    repeat-expr: num_objs

This implies that headers[] and footers[] shall always have the same number of objects. How are the strengths are assigned in this case and how do we enforce that they have equal number of objects?

@KOLANICH
Copy link
Author

KOLANICH commented Sep 13, 2016

1 see above example
2 throwing an exception, of course

ps graph evaluation is done by KS compiler
runtume checks are done by generated code

and I fixed priorities to match the lines I made supposing strenghs were 1 and 2

@GreyCat
Copy link
Member

GreyCat commented Mar 24, 2017

I've commited very basic PoC code that demonstrated seq serialization in Java. It is available in distinct "serialization" branches. To test it, one'll need:

Obviously, only a few tests were converted, and, to be frank, now only a very basic set of types is supported. Even strings are not implemented right now. Testing is very basic too, one can run ./run-java and see that two packages are ran: spec is "normal", reading tests, and specwrite are writing tests.

I'd really love to hear any opinions on the API (both runtime & generated), Java implementation (it really pain in the ass, as ByteBuffer does not grow, and you have to preallocate the array, or probably reimplement everything twice with something that grows), etc.

@GreyCat
Copy link
Member

GreyCat commented Mar 25, 2017

Strings work, repetitions work, enums work, things are slowly coming to reality :)

@GreyCat
Copy link
Member

GreyCat commented Mar 27, 2017

Serialization progresses slowly. Basic in-stream user types and processing on byte types work, various fancy byte type stuff like terminator and pad-right works, etc, etc.

I've got an idea that might be very simple to implement.

Step 1: manual checks

Generate _read, _write and _check methods. _check runs all the internal format consistency checks to ensure that stuff that will be written will be read back properly. For example:

seq:
  - id: len_of_1
    type: u2
  - id: str1
    type: str
    size: len_of_1 * 2 + 1

would generate:

    public void _read() {
        this.lenOf1 = this._io.readU2le();
        this.str1 = new String(this._io.readBytes(lenOf1() * 2 + 1), Charset.forName("ASCII"));
    }
    public void _write() {
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }
    public void _check() {
        if (this.str1.bytes().size() != lenOf1() * 2 + 1)
            throw new FormatConsistencyError("str1 size", this.str1.bytes().size(), lenOf1() + 3);
    }

To use this properly, one must manually set both lenOf1 and str1:

r.setStr1("abcde");
r.setLenOf1(2);
r._check(); // should pass, so we're clean to take off
r._write(); // should write consistent data that's guaranteed to be readable back

Step 2: dependent variables

We declare some fields as "dependent", and mark them up in ksy:

seq:
  - id: len_of_1
    type: u2
    dependent: (str1.to_b("ASCII").size - 1) / 2
  - id: str1
    type: str
    size: len_of_1 * 2 + 1

This means that len1 becomes a read-only variable, setLenOf1 setter won't be generated. Instead, it would generate slightly different _write:

    public void _write() {
        this.lenOf1 = (str1().getBytes(Charset.forName("ASCII")).size() - 1) / 2;
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }

Obviously, using this boils down to single r.setStr1("abcde");.

Any comments / ideas / etc?

@KOLANICH
Copy link
Author

KOLANICH commented Mar 28, 2017

Any comments / ideas / etc?

  1. move evaluation of expression into a separate method
  2. use check and write in generic form without explicit expressions in it, but with a method from 1)
  3. Why not to use value instead of dependent?
  4. why do we use .to_b("ASCII").size? String encoding is known, so why not just .size?
  5. this dependent are ugly, it'd be nice to eliminate them, but we need a decision what kind of expressions should be resolved automatically. I guess linear ones should be enough. Another question is how to solve them. There is exp4j, for parsing and storage, but it'll require some code to build a simple symbolic gaussian elimination solver over it. If we wanna write in python, here are docs for lib wrapping multiple SMT solvers: https://github.com/angr/angr-doc/blob/master/docs/claripy.md

@GreyCat
Copy link
Member

GreyCat commented Mar 28, 2017

  • move evaluation of expression into a separate method
  • use check and write in generic form without explicit expressions in it, but with a method from 1)

You mean, something like that?

    public void _read() {
        this.lenOf1 = this._io.readU2le();
        this.str1 = new String(this._io.readBytes(_sizeStr1()), Charset.forName("ASCII"));
    }
    public void _write() {
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }
    public void _check() {
        if (this.str1.bytes().size() != _sizeStr1())
            throw new FormatConsistencyError("str1 size", this.str1.bytes().size(), lenOf1() + 3);
    }
    public int _sizeStr1() {
        return lenOf1() * 2 + 1;
    }

Does it bring any benefits? It won't really simplify generation (probably on the contrary), and we'll need to invent tons of names for all that size, if, process, etc expressions.

  • Why not to use value instead of dependent?

Naming is, of course, still to be discussed. One argument against value I have is that value is already used for reading in value instances,

why do we use .to_b("ASCII").size? String encoding is known, so why not just .size?

Using size on a string will give out a length of string in characters. If you'll put some non-ASCII into that string, proper .to_b("ASCII").size conversion will give you an exception, while just taking "number of bytes = number of characters" will give you corrupted data.

I guess we could try to do some sort of .bytesize method for strings taken verbatim from format definition, where "encoding is known", to save retyping the encoding name. However, it still won't work on modified strings, i.e. it's possible to implement str1.bytesize and it won't work with (str1 + 'x').bytesize (as the latter is CalcStrType, which lacks any source encoding info by design).

  • this dependent are ugly, it'd be nice to eliminate them

First of all, you can't really eliminate them completely in any case. Some functions are just irreversible, and in some cases you'll have more free variables than constraints. For example:

seq:
  - id: a
    type: u1
  - id: b
    type: u1
  - id: my_str
    type: str
    size: a + b

Even if you have byte size of my_str, you can't set both a and b automatically. Reversing stuff automatically would be more like a syntactic sugar feature, just to save from typing boring stuff where it's possible. In fact, I heavily suspect that we'll cover 95% of cases with very crude logic like size: a => a = attr.size.

@KOLANICH
Copy link
Author

KOLANICH commented Mar 28, 2017

value is already used for reading in value instances,

Yes, that's why I've chosen it. It has the same semantic: deriving a value of a struct member from other ones makes it some kind of an instance, but tied to offset.

we could try to do some sort of .bytesize

size attribute in a .ksy means size in bytes. So I see no reason it to mean anything else in expression language.

First of all, you can't really eliminate them completely in any case.

Yes, we have already discussed this.

Even if you have byte size of my_str, you can't set both a and b automatically.

Since they are of equal strentgh. This should cause ksc to throw a warning about undefined behavior.

Reversing stuff automatically would be more like a syntactic sugar feature, just to save from typing boring stuff where it's possible.

Not only. One expression can contradict another one, it can cause nasty errors ot can be used as a backdoor. Another side is that we (humans) don't know exact expressions ahead of time. So I propose the following:
1 to have a syntax to provide a manual expression
2 missing expressions are generated by compiler and inserted into a ksy with another type of expressions
3 human examines ksy output for errors and malicious code
4 verified output is used to generate actual code
5 we would sometimes want to

  • regenerate all non-manual expressions in a ksy
  • check that expressions don't contradict each other

In fact, I heavily suspect that we'll cover 95% of cases with very crude logic like size: a => a = attr.size.

Maybe.

@ixe013
Copy link

ixe013 commented Oct 5, 2017

I would like to play with the serialization branch...

Is there a pre-built compiler and Java runtime of the serialization branch available? If not, that's ok, I'll setup a Scala build environment.

@GreyCat
Copy link
Member

GreyCat commented Oct 5, 2017

@ixe013 There are no pre-built packages for such experimental branches, but it's relatively straightforward to try it yourself. You don't really need anything beyond sbt — it would download all the required stuff (including Scala compiler and libraries of the versions needed) automatically. See dev docs — it's literally one command to run, like sbt compilerJVM/universal:packageBin — et voila.

@glenn-teillet
Copy link

I was able to build a Read-write version of my ksy, however the generaed code does not compile as the Reader reads from Int and stores them in Enums as Long (Int->Long) and the _writes() try to write the Enum Long as Int (Long->Int cause error).

I have these errors:
The method writeBitsInt(int, long) is undefined for the type KaitaiStream
The method writeU1(int) in the type KaitaiStream is not applicable for the arguments (long)
The method writeU2be(int) in the type KaitaiStream is not applicable for the arguments (long)

@glenn-teillet
Copy link

I also have errors in every constructor, where it tries to assign _parent received as a KaitaiStruct in the constructor signature to the _parent without casting to the more specific type.

Type mismatch: cannot convert from KaitaiStruct to KaitaiStruct.ReadWrite

@GreyCat
Copy link
Member

GreyCat commented Oct 10, 2017

@glenn-teillet There's still lots of work to do, it's nowhere near production-ready. Before trying to compile read-life ksy files with lots of complicate details, probably we should spend some time getting read-write test infrastructure to work, and, probably, porting that older serialization branch code to modern codebase.

@KOLANICH
Copy link
Author

KOLANICH commented Oct 23, 2017

Some suggested definitions for the serialization spec:

  • Serialization.
    Let we have a binary format f, a set of sequences of bits FS, its subset of sequences of bytes making a valid format FS_f, a set of object-oriented Turing-complete programming languages PL, a set of valid Kaitai Struct definitions KSY, including the subset of definitions for the format f KSY_f, and the KS compiler KSC : PL × KSY → (PSC, SSC), where PSC: FS → O is a set of a parsing programs, SSC: O → FS is a set of serializing programs and ssc_{ksy_f}(psc_{ksy_f}(s)) ≡ s ∀s ∈ FS_f, ∀ksy_f ∈ KSY_f, ∀pl ∈ PL, KSC(pl, ksy_f)=(psc_{ksy_f}, ssc_{ksy_f}). To be practically usable there should be a way to create an o= psc_{ksy_f}(s) programmatically without doing any parsing of actual bit string s.

Serialization is the part of KSC producing serialization programs.

  • Internal representation are the objects created by KSC-generated code in program's runtime.
  • Finalization is a process of transforming the internal representation that way, that only trivial transformations are left to be done to create a serialized structure.
  • An expression is the mapping of a subset of internal representation variables called expression arguments to another variable called expression output.
  • A reverse expression is an expression mapping an original expression's output back to its arguments with respect to the current state of the internal representation (including the arguments).
  • Trivial transformations are the ones not involving computing any expressions. Examples of trivial transformations are endianess conversions and bit shifts induced by using bit-sized fields.

Comments?

@FSharpCSharp
Copy link

This is a very interesting subject. What is the current status of this? Will this be pursued in the future? It would be very good if you could use serialization. Especially for binary formats based on bit's this would be a great relief. I hope that the issue will be pursued further and that we can also benefit from it in the future.

@GreyCat
Copy link
Member

GreyCat commented Dec 22, 2017

No updates here, latest comments like #27 (comment) still reflect current state of affairs. Help is welcome :)

krisutofu pushed a commit to krisutofu/kaitai_struct that referenced this issue Jan 2, 2022
Typo fix: "slighly" --> "slightly"
@DarkShadow44
Copy link

Is there any plans to merge the initial serialization support anytime soon? It's though to base future work on such an old parallel branch.
For my C port, once that's done, I could port over my serialization concept from my old python based compiler, but I'm not sure that would be accepted. Opinions?

@jchv
Copy link

jchv commented May 6, 2022

It’d be really nice to see any motion on this. I think even having pretty broken serialization would be good for a start…

@poelzi
Copy link

poelzi commented Jun 1, 2022

I seriously can't understand why this bug is still open after more then 5 years. If you have any support, you get small PRs that improve the behaviour because people need certain features. Expecting perfect support in the beginning is unlikely, but still no support at all.
We will eigther have to switch to another solution or give some other compiler a chance.
If somebody asked me what to use, I will say he should stay away from this and use something with write support already implemented. I really like the idea and the format is nice, but yeah. The world is not read only.

@jchv
Copy link

jchv commented Jun 2, 2022

The problem is simple: while there has been some work done, I don’t think there is anyone actively working on this. It’s worth understanding that a huge use case for Kaitai is in dissecting data and reverse engineering, where you indeed would not care about serialization. It’d be nice to get things moving, but even if some partial implementation would be acceptable, nobody is working on pushing one forward at the moment. So I guess that would be the first problem to address, if any.

Open source isn’t really a competition, though. If you happen to know of a good alternative that fills this niche, it would be nice to know. On my end, I have a Go library that sort-of overlaps, but aside from being limited to Go, it, too, could use some more maintenance…

@Mimickal
Copy link

Mimickal commented Jun 2, 2022

Kaitai's compiler does have a target for Python's Construct library, which does support both serialization and deserialization. I have several open PRs for that compiler target to bring it up to parity with other targets. There are some KSY mechanisms it doesn't quite support yet, but for basic use cases it does work.

Of course, that does also require using Python.

@generalmimon
Copy link
Member

generalmimon commented Aug 9, 2022

I've been taking notes on various things about the implementation of serialization - in the event that someone wants to read them, here they are: https://gist.github.com/generalmimon/fc22e97faf1fe4b4edc8279b0caa152d

@DarkShadow44
Copy link

DarkShadow44 commented Dec 19, 2022

Are you sure we want things like mixing read/write? What I'd target, at least as first version, is to just have a "write to file" functionality. Might not be perfectly efficient, but a lot simpler.
Seems to be a lot easier than what you're going for, just fill the classes and then give the main class the instruction to write itself to file/stream.

Also, not sure if I missed that in your document, what about structures that are pointed to from different locations? That's why I originally opted for a two pass writing, so you get a change to update the data with real offsets from the dry run before it's actually written.

@generalmimon
Copy link
Member

@DarkShadow44:

Are you sure we want things like mixing read/write?

I don't know what you're referring to. Can you be more specific and perhaps quote some parts of the document to ensure that we're on the same page?

Seems to be a lot easier than what you're going for, just fill the classes and then give the main class the instruction to write itself to file/stream.

What do you think I am going for after reading the document? "Fill the classes and then give the main class the instruction to write itself to file/stream" sounds exactly how I intend the resulting serialization-capable classes to be used. I'm just discussing what happens when you give the main class the instruction and exactly how it's done under the hood.

what about structures that are pointed to from different locations?

Once again, I have no clue what you mean. This can mean so many things. "Pointed to" is like when you have an positional instance (instances + pos) that parses a structure located at a particular byte offset in the file? "From different locations" means that the same block of bytes at that byte position in the input binary file is parsed from multiple instances, i.e. "locations" in the .ksy specification?

@DarkShadow44
Copy link

DarkShadow44 commented Dec 19, 2022

Are you sure we want things like mixing read/write?

I don't know what you're referring to. Can you be more specific and perhaps quote some parts of the document to ensure that we're on the same page?

so if you would mix writes and reads within one byte, e.g. writeBitsInt(3) + readBitsInt(5), the readBitsInt will misinterpret what writeBitsInt has just set to the bits and bitsLeft properties, thus it would just do some undefined behavior

Guess I missed the later part

Besides that, unlike the bXbe - bXle situation which can occur in a real .ksy specification, consecutive mixed {write,read}BitsInt* calls should not happen (once everything is correctly implemented in the compiler). 

Ignore that, please

Seems to be a lot easier than what you're going for, just fill the classes and then give the main class the instruction to write itself to file/stream.

What do you think I am going for after reading the document? "Fill the classes and then give the main class the instruction to write itself to file/stream" sounds exactly how I intend the resulting serialization-capable classes to be used. I'm just discussing what happens when you give the main class the instruction and exactly how it's done under the hood.

Seeing

        BufferedStruct r = new BufferedStruct() {{
            len1(0x10);
            block1(new BufferedStruct.Block() {{
                number1(0x42);
                number2(0x43);
            }});

            len2(0x8);
            block2(new BufferedStruct.Block() {{
                number1(0x44);
                number2(0x45);
            }});

            finisher(0xee);
        }};

it looked like it's written as soon the functions are called.

what about structures that are pointed to from different locations?

Once again, I have no clue what you mean. This can mean so many things. "Pointed to" is like when you have an positional instance (instances + pos) that parses a structure located at a particular byte offset in the file? "From different locations" means that the same block of bytes at that byte position in the input binary file is parsed from multiple instances, i.e. "locations" in the .ksy specification?

Take the following dummy file:

meta:
  id: test
  endian: le
seq:
  - id: chunks
    type: chunk
    repeat: expr
    repeat-expr: 4
  - id: pos_chunk_2
    type: u2
types:
  chunk:
    seq:
      - id: len
        type: u1
      - id: data
        size: len

And now assume pos_chunk_2 has to be the absolute address (from beginning of file) where the second chunk starts. For reading no problem, but for writing we need to set the value correctly. In this case it's simple, but it can get a lot more complicated. I currently don't see how we can do that using a single writing pass.

@KOLANICH
Copy link
Author

And now assume pos_chunk_2 has to be the absolute address (from beginning of file) where the second chunk starts.

pos_chunk_2 is never used, so it can be arbitrary. But you have detected the right problem: what is the interface at which the system of equations should be stopped being built?

I guess:

  • If pos_chunk_2 was used in this spec, setting it right would have been the responsibility of the code generated from this spec. If this spec was used by some external spec and pos_chunk_2 was used by that spec as a pos, it'd have been a responsibility of that spec to set the chunk as a pos. And sometimes it may be needed to allow a solver to treat imported specs as white boxes. So, I guess, we need a hint about that.

Another issue: we sometimes use pos-instances, which byte ranges intersect with other entities. And it is the kinda officially-supported way to implement switcheable endianness sometimes. We need to handle this situation somehow.

@DarkShadow44
Copy link

If pos_chunk_2 was used in this spec, setting it right would have been the responsibility of the code generated from this spec.

Either that, or give the user the ability to update variables with position info before they're written. First calculate all target positions, then let the user update the structure again. Seems easier than implementing a hint system, anyways.

@generalmimon
Copy link
Member

generalmimon commented Dec 20, 2022

@DarkShadow44:

Seeing

        BufferedStruct r = new BufferedStruct() {{
            len1(0x10);
            block1(new BufferedStruct.Block() {{
                number1(0x42);
                number2(0x43);
            }});
// ...

it looked like it's written as soon the functions are called.

Sorry, this was really just an incomplete snippet in a rant about whether or not keep the set prefix for setters. That incomplete snippet would just create an empty BufferedStruct object and fill it with values. To actually serialize something, you need to call the _write() method (and if the _io property isn't set to a valid writable stream, the _write(io) overload has to be called instead).

If you want to look at a bit more complex use case what the serialization API looks like, see test TestExpr2.java:24. It shows an example of reading a file, editing the values (while some changed fields are actually dependencies of value instances, which must be manually invalidated) and writing it to a in-memory stream. The manual calculation of the byte size for the newIo stream to write to is also accurate, as explained later.

And now assume pos_chunk_2 has to be the absolute address (from beginning of file) where the second chunk starts. For reading no problem, but for writing we need to set the value correctly. In this case it's simple, but it can get a lot more complicated. I currently don't see how we can do that using a single writing pass.

My design has a simple answer to all sizes and offsets - calculated and set by the user, checked by KS-generated code (well, if EOF errors count as checking, but I guess they do; larger sizes than needed are always valid). Also, all streams have a fixed size (including the root stream), i.e. there is no "growable" stream.

This is very likely to change in the future to make the serialization process more user-friendly, but I would insist on keeping it for the early version of serialization, because it makes many things simpler (from the perspective of KS compiler, not the user, of course). And yet there are quite a few things that are problematic that you don't even know about, but I have already dealt with many of them.

You'd think that having to set sizes of everything as a user is unmanageable, but it's actually easy - you do it gradually just for the direct fields of the type you're currently processing, and delegate lengths that depend on child types to their respective methods that fill each one of them:

    private void Build()
    {
        var cdr = new CoreldrawCdr(new List<KaitaiStream>(), null);
        cdr.RiffChunk = new CoreldrawCdr.RiffChunkType(null, cdr, cdr.M_Root);
        var lenFile = Fill(cdr.RiffChunk);
        ;
    }

    private uint Fill(CoreldrawCdr.RiffChunkType chunk)
    {
        chunk.ChunkId = new[] { (byte)'R', (byte)'I', (byte)'F', (byte)'F' };
        chunk.Body = new CoreldrawCdr.CdrChunkData(null, chunk, chunk.M_Root);
        chunk.LenBody = Fill(chunk.Body);
        chunk.PadByte = GetPadByte(chunk.LenBody);
        return
            (uint)chunk.ChunkId.Length + // chunk_id
            4 + // len_body
            chunk.LenBody + // body
            (uint)chunk.PadByte.Length // pad_byte
        ;
    }

    private uint Fill(CoreldrawCdr.CdrChunkData chunkData)
    {
        // ...

You can also use this to calculate offsets, reserve space for instances and everything. It's always just adding numbers together and rarely some multiplication, very easy.

So I don't have any concerns about it being particularly limiting. Lots of people in this thread seemed like they would be happy with any simple, stupid solution, and I'm here to provide it.


@KOLANICH:

Another issue: we sometimes use pos-instances, which byte ranges intersect with other entities. And it is the kinda officially-supported way to implement switcheable endianness sometimes. We need to handle this situation somehow.

I've already thought of that and resolved it: you can individually disable which positional instances you want to write in each object. There is a special setter for it (this line is from InstanceStd.java generated from test format instance_std.ksy):

    public void setHeader_ToWrite(boolean _v) { _toWriteHeader = _v; }

I didn't know a better name for it, but it means "whether to write the instance header or not". The private _toWriteHeader field is true by default; if you call setHeader_ToWrite(false), the header instance will be ignored (i.e. not written) when you call _write().

I encourage you to explore the serialization branches - everything around instances is implemented. Even in the context of substreams and the io key, which was quite tricky to get it right.

I'm currently working on substreams with process, they should be coming soon.

@DarkShadow44
Copy link

So I don't have any concerns about it being particularly limiting. Lots of people in this thread seemed like they would be happy with any simple, stupid solution, and I'm here to provide it.

Yeah, me too! Thanks for your work and the explanation, IMHO it will get a lot easier when you have a solid foundation!

@generalmimon
Copy link
Member

Serialization for Java is done. I started a documentation page for how to use it: https://doc.kaitai.io/serialization.html

@GreyCat
Copy link
Member

GreyCat commented Jul 27, 2023

Let me break out a separate discussion with a checklist of what needed to merge serialization into master: #1060

@JocelynLoft
Copy link

Thanks for all the hard work! I'm super interested in getting some implementation for C++, is there an existing effort toward that goal or anything I could help with ?

@hu-chia
Copy link

hu-chia commented Jan 16, 2024

I'm a little frustrated using the serialization for Java, everything looked great at first, but when I add the CustomProcessor for encryption and decryption, I got many BufferOverflowExceptions.

The generated code created a lot of ByteBufferKaitaiStream instances of length x_InnerSize, but the length of the byte array returned by the encryption logic is larger than the original bytes, then the WriteBackHandler threw BufferOverflowException. I guess I must set the correct x_InnerSize before write the ReadWrite to KaitaiStream, is it possible?

What is the correct way to set x_InnerSize?

For now, I'm implementing a KaitaiStream which could grow itself when the capacity is insufficient, but I have to manual edit the generated code to replace all the ByteBufferKaitaiStream, Is there a better way?

@hu-chia
Copy link

hu-chia commented Jan 17, 2024

Another problem, the generated java code has below behavior:

        public void _write_Seq() {
            this._io.writeS4be(lenPayload());
            Compressor _process_payload = newCompressor();
            this._raw_payload = _process_payload.encode(payload());
            // It's basically impossible to predict the length of the data and set lenPayload correctly before performing compression.
            if ((_raw_payload().length != lenPayload()))
                throw new ConsistencyError("payload", _raw_payload().length, lenPayload());
            this._io.writeBytes(_raw_payload());
        }

@generalmimon Could you please give me some advice?

@generalmimon
Copy link
Member

@hu-chia First of all, thanks for your feedback!

but the length of the byte array returned by the encryption logic is larger than the original bytes, then the WriteBackHandler threw BufferOverflowException. I guess I must set the correct x_InnerSize before write the ReadWrite to KaitaiStream

Your observations are correct. As I've already hinted in some of previous comments, the serialization feature deals only with serialization itself so far - it expects a fully populated object tree, including lengths, offsets, CRC-32 checksums and whatnot, which must be given by the user in advance. No auto-deriving of values of auxiliary fields takes place yet 1. Also, as you correctly note in your second comment, as soon as you call _write(), the writing process is non-interactive - it does not stop at any point to give you the opportunity to supply more input.

To your specific questions:

What is the correct way to set x_InnerSize?

The *_InnerSize field is actually the size of the inner substream, which means it is independent of the compression algorithm used. It only depends on the data before performing the compression, which you should be able to determine without problems, just by summing the sizes of the fields you have in the substream.

            this._raw_payload = _process_payload.encode(payload());
            // It's basically impossible to predict the length of the data and set lenPayload correctly before performing compression.
            if ((_raw_payload().length != lenPayload()))
                throw new ConsistencyError("payload", _raw_payload().length, lenPayload());
            this._io.writeBytes(_raw_payload());

Yes, this is a bit tricky. Basically you have to make a test write of the contents that will be compressed, compress it and see how many bytes you get. Let me demonstrate the suggested solution at an example:

meta:
  id: zlib_serialization
  endian: le
seq:
  - id: len_body
    type: u4
  - id: body
    size: len_body
    type: contents
    process: zlib
types:
  contents:
    seq:
      - id: foo
        type: s4
      - id: bar
        type: f4

One feasible way to serialize this is as follows (using the same pattern as in #27 (comment)):

    private static byte[] build() throws IOException {
        ZlibSerialization r = new ZlibSerialization();
        r.body = new ZlibSerialization.Contents(null, r, r._root());
        r.setBody_InnerSize(fill(r.body));

        // test write
        byte[] bodyBuf = new byte[r.body_InnerSize()];
        try (KaitaiStream io = new ByteBufferKaitaiStream(bodyBuf)) {
            r.body._write(io);
        }
        r.lenBody = KaitaiStream.unprocessZlib(bodyBuf).length;
        r._check();

        int lenFile =
            4 + // len_body
            (int)r.lenBody // body
        ;
        byte[] output = new byte[lenFile];
        try (KaitaiStream io = new ByteBufferKaitaiStream(output)) {
            r._write(io);
        }
        return output;
    }
    
    private static int fill(ZlibSerialization.Contents cont) {
        cont.setFoo(-4);
        cont.setBar(1.5f);
        cont._check();
        return
            4 + // foo
            4 // bar
        ;
    }

    private static String byteArrayToHex(byte[] arr) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < arr.length; i++) {
            if (i > 0)
                sb.append(' ');
            sb.append(String.format("%02x", arr[i]));
        }
        return sb.toString();
    }

    public static void main(String[] args) throws IOException {
        System.out.println(byteArrayToHex(build()));
        // output: 10 00 00 00 78 9c fb f3 ff ff 7f 06 86 03 f6 00 1b 95 04 f9
    }

For CRC-32 checksums, for example, we can use the same approach, just instead of checking how many compressed bytes we get, we calculate the CRC-32 checksum and save it in the object tree so that it's available once it comes to the actual serialization.

Of course, this requires the processing routine to be deterministic.

Footnotes

  1. the only exception are stream parameters, because there is indeed no way that the user could set a stream parameter to the "correct" stream, as (sub)streams are created only during the uninterruptible _write() execution.

@hu-chia
Copy link

hu-chia commented Jan 18, 2024

I guess there would be two compressions here( include the process in generated code ), am I right?

        r.lenBody = KaitaiStream.unprocessZlib(bodyBuf).length;
        r._check();

My solution now is to pass in the compressed/encrypted data when constructing the object tree, and provide an empty implementation for the encode method of CustomProcessor. But this is not intuitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests