Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow repeat: until to optionally exclude the last non-matching element (look-ahead feature) #156

Open
davidhicks opened this issue Apr 21, 2017 · 14 comments

Comments

@davidhicks
Copy link

davidhicks commented Apr 21, 2017

Currently repeat: until will always read and include the non-matching element. It would be preferable if this behaviour could be toggled (with consume: false?) so that it is possible to perform a look-ahead on future bytes in the stream. Because an attempt would need to be made to read the element following the last valid element matching the repeat-until condition, it is possible that errors may be encountered. In such a case the behaviour should be to catch and ignore look-ahead reading errors, and reset the stream position back to the last known element which was successfully read and passed the repeat-until condition.

@davidhicks davidhicks changed the title Allow repeat: until to optionally exclude the last non-matching element (positive look-ahead feature) Allow repeat: until to optionally exclude the last non-matching element (look-ahead feature) Apr 21, 2017
@davidhicks
Copy link
Author

davidhicks commented Apr 21, 2017

An example of where this feature could be helpful is in parsing the JPEG Interchange Format (see ITU-T T.81 at https://www.w3.org/Graphics/JPEG/itu-t81.pdf). In the non-hierarchical variant of this format, any number of table or misc segment elements can occur in any order prior to a mandatory frame header element (see figure B.2/page 34 of ITU-T T.81).

I have managed to implement a workaround, demonstrated below (non-essential parts removed):

types:
  frame:
    seq:
      - id: table_or_misc_segments
        type: table_or_misc_segment
        repeat: until
        repeat-until: >-
          _.next_marker != marker_codes::define_quantization_tables and
          _.next_marker != marker_codes::huffman_table_specification and
          _.next_marker != marker_codes::arithmetic_coding_conditioning_specification and
          _.next_marker != marker_codes::define_restart_interval and
          _.next_marker != marker_codes::comment and
          _.next_marker != marker_codes::application_segment_0 and
          _.next_marker != marker_codes::application_segment_1 and
          _.next_marker != marker_codes::application_segment_2 and
          _.next_marker != marker_codes::application_segment_3 and
          _.next_marker != marker_codes::application_segment_4 and
          _.next_marker != marker_codes::application_segment_5 and
          _.next_marker != marker_codes::application_segment_6 and
          _.next_marker != marker_codes::application_segment_7 and
          _.next_marker != marker_codes::application_segment_8 and
          _.next_marker != marker_codes::application_segment_9 and
          _.next_marker != marker_codes::application_segment_10 and
          _.next_marker != marker_codes::application_segment_11 and
          _.next_marker != marker_codes::application_segment_12 and
          _.next_marker != marker_codes::application_segment_13 and
          _.next_marker != marker_codes::application_segment_14 and
          _.next_marker != marker_codes::application_segment_15
    types:
      table_or_misc_segment:
        seq:
          - id: marker
            type: u2
            enum: marker_codes
          - id: length
            type: u2
          - id: data
            size: length - 2
            type:
              switch-on: marker
              cases:
                #...
        instances:
          next_marker:
            pos: _io.pos
            type: u2
            enum: marker_codes

It seems that the next_marker instance is calculated after the table_or_misc_segment sequence has been fully parsed, thus pos: _io.pos returns the position within the file of the next byte beyond the table_or_misc_segment sequence.

Is this workaround suitable, or just a dodgy hack that will break in various translators?

@GreyCat
Copy link
Member

GreyCat commented Apr 24, 2017

There are several independent issues here.

  1. Is it possible to omit inclusion of the last read element in the resulting array? Yes, it's certainly possible and the change is pretty much trivial. However, I'm not sure if it's all that useful per se: one can almost always skip processing terminating element on client's side, and, given that almost have no array operations in expression language, it is not of a big concern for anything related to processing that might happen in expression language inside the ksy.

  2. Can we backtrack pointer after reading last element, which happened to be not what we were looking for? Probably that should be not too hard to implement, i.e. we generate loops like these:

bool ok;
do {
    pos = this._io.pos();
    _it = new Element(this._io, this, _root);
    ok = _it.marker() != 0; // or some other condition
    if (ok) {
        this.elements.add(_it);
    } else {
        this._io.seek(pos);
    }
} while (ok);
  1. Backtracking "after getting an error" is somewhat more difficult. First of all, there is no universal concept of "error" in KS. There is contents thing for magic validation, there is read-past-end-of-stream exception, and there are several proposals (like Checksumming, other constraints and asserts validation and exceptions #81) to add validation expression, but they are not set in stone and/or implemented yet. I believe that without getting that concept solid, we can't argue about implementing backtracking-after-error.

The standard approach to what you've demonstrated is something along the lines of:

seq:
  - id: elements
    type: element
    repeat: until
    repeat-until: not _.is_valid_marker
types:
  element:
    seq:
      - id: marker
        type: u2
        enum: # ...
      - id: length
        type: u2
        if: is_valid_marker
      - id: body
        size: length - 2
        type:
          switch-on: marker
          cases:
            # ...
        if: is_valid_marker
    instances:
      is_valid_marker:
        value:
          # do some check for marker validity, like tons of comparisons or
          # something fancier

However, it consumes extra 2 bytes of the stream as "next marker". Your approach avoids that, but, yeah, it feels pretty hacky. I believe it should work in all supported languages the same way, but it would very hard to use this ksy to generate writer code.

Probably it's easy to implement something like the code suggested in (2) using a syntax like:

seq:
  - id: elements
    type: element
    repeat: while
    repeat-while: _.marker != 0

I guess it should be more or less understandable and expected by the majority of users. Would it be ok with you?

@davidhicks
Copy link
Author

repeat: while would be perfect for this situation. I don't think option (3) would work because the next marker would be read into elements.last. The next item in the sequence would therefore not be starting on the next marker.

@davidhicks
Copy link
Author

davidhicks commented Apr 25, 2017

Just to clarify with example (2),

_it = new Element(this._io, this, _root); should not raise an exception (if there aren't enough bytes in the stream to read, if a magic number doesn't match, etc). Below is a rough example of desired output code:

bool ok;
do {
    pos = this._io.pos();
    ok = false;
    try {
        _it = new Element(this._io, this, _root);
        if (_it.marker() != 0) { // or some other condition
            ok = true;
        }
    } catch (...) {} //all exceptions which need to be caught
    if (ok) {
        this.elements.add(_it);
    } else {
        this._io.seek(pos);
    }
} while (ok);

A read of Element may be unsuccessful due to the next element being of a different type with a different magic number, different length than Element, etc.

@GreyCat
Copy link
Member

GreyCat commented Apr 26, 2017

It's a bad idea to catch all exceptions, especially in low-level-sensitive languages such as C++. I've already mentioned as (3) that a concept of "erroneously read element" is next to non-existent now in KS. For example, typecasting (foo.as<bar>) of incompatible type will raise some sort of cast exception in most languages, but not in C++. Division by zero is exception in Java, Ruby and Python, Infinity in JavaScript, and abortion with a signal in C++ (platform-dependent). Etc, etc.

So, if you want error catching and ignoring as well, we need to:

  1. Develop a more or less viable system of exceptions.
  2. Add some way to define that exception catching behavior. Probably it should be the same for all types (or at least all user types) parsing, something akin to eos-error: false.
  3. Do lots of tests and define behavior in complex scenarios, such as ones with buffered element (i.e. with size) + process, etc, etc.

For practical purposes, I guess, it's much easier (and cleaner) to just define element type in a way that it will not break parsing with exception, but rather will result in mostly nulled structure.

@davidhicks
Copy link
Author

davidhicks commented Apr 26, 2017

I'm happy with the suggested approach of implementing exceptions and allowing exceptions to be caught/ignored. I realise this is no easy task though!

As an interim solution, I was thinking something like the following could be done:

element_of_unknown_type:
  seq:
    - id: element
      type:
        switch-on: first_byte
        cases:
          0: element_of_type_a
          1: element_of_type_b
          2: element_of_type_c
  instances:
    first_byte:
      pos: 0
      type: u1

In order for this structure to be used successfully in cases where the size of element_of_unknown_type is unknown until it is read fully, a new mechanism would be required to create substreams of unknown size (allowing method pos to be used). Is this a possibility? Alternatively, methods pos and size could be added to all elements in the tree, allowing instances to read data from the stream before seq attempts to do so (by use of if/switch-on/etc with seq).

@dgutson
Copy link

dgutson commented Feb 28, 2019

Hi, is anybody working on this enhancement?

@GreyCat
Copy link
Member

GreyCat commented Feb 28, 2019

Not at this moment. Probably we'll need to implement some kind of proper exception systems first.

@Deservoir
Copy link

Deservoir commented Apr 7, 2022

The absence of this feature is a blocker for me and anybody who deal with variable number of blocks that:

  1. have a clear structure that starts with some marker (so I can at least try to distinguish it from something else)
  2. have different actual size (so I cannot predict anything without looking ahead)
  3. have no terminator (so I cannot rely on "terminator" feature)

I would rather deal with guessing is this instance of _ suitable or not than have no ability to do it at all (

@akatatsu27
Copy link

akatatsu27 commented Dec 18, 2022

Another case where this is needed from a file I'm trying to parse: I have an array of my_type that is terminated by the value 0xFFFFFFFF. I am not aware of any non-stupid method that would at least prevent me from reading the next 4 bytes after the array terminator. I find the fact that a similar feature exists for strings (consume: false), this exact inability of the language is explicitly mentioned at section 4.10.3 of the user's guide, and that this issue is still open from 2017, a little ridiculous

types:
  my_type:
    seq:
     - id: item1
       type: u2
     - id: item2
       type: u2
     - id: item3
       type: u2
     - id: item4
       type: u2

@dgutson
Copy link

dgutson commented Dec 18, 2022

CC @qequ

@simi
Copy link

simi commented Dec 3, 2023

I'm looking for this feature as well. I'm working on specific format using "null" entry as a separator. Following specification parses format properly, but it includes the null entry in entries. I would like to have entries and "null entry" separated if possible.

seq:
  - id: entries
    type: entry
    repeat: until
    repeat-until: _.file_name == ""

types:
  entry:
    seq:
    - id: file_name
      type: strz
      encoding: ASCII
    - id: mime_type
      size: 4
    - id: original_size
      type: u4
    - id: offset
      type: u4
    - id: timestamp
      type: u4
    - id: datasize
      type: u4

@davidhicks thanks for the tip, I was able to bypass my problem with looking forward using instance.

@JorisVanEijden
Copy link

I'm guessing this would solve my use-case too. I am currently unable to parse this:

f1 f4 41 20 73 65 6e 74 65 6e 63 65 3a f4 41 6e  ..A sentence:.An 
6f 74 68 65 72 20 73 65 6e 74 65 6e 63 65 2e f1  other sentence..
f4 50 61 72 61 67 72 61 70 68 20 32 f0 01 02 03  .Paragraph 2....

f1 represents the start of a paragraph and in reality has some more fields before the text-segments start.
f4 represents the start of a text-segment and in reality has some more fields like font color and style.
f0 represents the end of a page

The desired result would be something like:

{
  "pages": [
    {
      "paragraphs": [
        {
          "textSegments": [
            {
              "text": "A sentence:"
            },
            {
              "text": "Another sentence."
            }
          ]
        },
        {
          "textSegments": [
            {
              "text": "Paragraph 2"
            }
          ]
        }
      ]
    }
  ]
}

The issue for Kaitai is that the strings have no terminator and no length.
And the amounts of paragraphs and textSegments within them is not specified anywhere.

In short:

  • I need to repeat paragraphs until a f0 is encountered
  • within each paragraph I need to repeat textSegments until either an f4 or an f0 is encountered

@smx-smx
Copy link

smx-smx commented Mar 11, 2024

In case it might help someone, here is a workaround i used for this enhanced PE parser i'm working on, based on the Kaitai one.

NOTE: It will work only if the size of the elements in the repeat block is fixed (no dynamically sized elements)
It's based on this workaround by @generalmimon

  import_descriptor_reader:
    seq:
      - id: invoke_items_start
        size: 0
        if: items_start >= 0
      - id: buffer
        type: import_descriptor
        repeat: until
        # will read until this condition, including. the instances block below will take care of skipping this item
        repeat-until:  _.name.value == 0 and _.first_thunk.value == 0
        size: sizeof<import_descriptor>
      - id: invoke_items_end
        size: 0
        if: items_end >= 0
    instances:
      # captures the data starting position
      items_start:
        value: _io.pos
      # captures the data ending position
      items_end:
        value: _io.pos
      # counts how many bytes we read, and excludes the last item
      items_size:
        value: items_end - sizeof<import_descriptor> - items_start
      # converts the size into number of elements
      items_count:
        value: items_size / sizeof<import_descriptor>
      # re-reads the items, this time without the last item
      items:
        pos: items_start
        repeat: expr
        repeat-expr: items_count
        type: import_descriptor
        size: sizeof<import_descriptor>

I'm using a similar variation of this workaround to answer the question "find the first element of an array that satisfies a condition", in order to find the PE section that contains the given virtual address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants