-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow repeat: until to optionally exclude the last non-matching element (look-ahead feature) #156
Comments
An example of where this feature could be helpful is in parsing the JPEG Interchange Format (see ITU-T T.81 at https://www.w3.org/Graphics/JPEG/itu-t81.pdf). In the non-hierarchical variant of this format, any number of I have managed to implement a workaround, demonstrated below (non-essential parts removed): types:
frame:
seq:
- id: table_or_misc_segments
type: table_or_misc_segment
repeat: until
repeat-until: >-
_.next_marker != marker_codes::define_quantization_tables and
_.next_marker != marker_codes::huffman_table_specification and
_.next_marker != marker_codes::arithmetic_coding_conditioning_specification and
_.next_marker != marker_codes::define_restart_interval and
_.next_marker != marker_codes::comment and
_.next_marker != marker_codes::application_segment_0 and
_.next_marker != marker_codes::application_segment_1 and
_.next_marker != marker_codes::application_segment_2 and
_.next_marker != marker_codes::application_segment_3 and
_.next_marker != marker_codes::application_segment_4 and
_.next_marker != marker_codes::application_segment_5 and
_.next_marker != marker_codes::application_segment_6 and
_.next_marker != marker_codes::application_segment_7 and
_.next_marker != marker_codes::application_segment_8 and
_.next_marker != marker_codes::application_segment_9 and
_.next_marker != marker_codes::application_segment_10 and
_.next_marker != marker_codes::application_segment_11 and
_.next_marker != marker_codes::application_segment_12 and
_.next_marker != marker_codes::application_segment_13 and
_.next_marker != marker_codes::application_segment_14 and
_.next_marker != marker_codes::application_segment_15
types:
table_or_misc_segment:
seq:
- id: marker
type: u2
enum: marker_codes
- id: length
type: u2
- id: data
size: length - 2
type:
switch-on: marker
cases:
#...
instances:
next_marker:
pos: _io.pos
type: u2
enum: marker_codes It seems that the Is this workaround suitable, or just a dodgy hack that will break in various translators? |
There are several independent issues here.
bool ok;
do {
pos = this._io.pos();
_it = new Element(this._io, this, _root);
ok = _it.marker() != 0; // or some other condition
if (ok) {
this.elements.add(_it);
} else {
this._io.seek(pos);
}
} while (ok);
The standard approach to what you've demonstrated is something along the lines of: seq:
- id: elements
type: element
repeat: until
repeat-until: not _.is_valid_marker
types:
element:
seq:
- id: marker
type: u2
enum: # ...
- id: length
type: u2
if: is_valid_marker
- id: body
size: length - 2
type:
switch-on: marker
cases:
# ...
if: is_valid_marker
instances:
is_valid_marker:
value:
# do some check for marker validity, like tons of comparisons or
# something fancier However, it consumes extra 2 bytes of the stream as "next marker". Your approach avoids that, but, yeah, it feels pretty hacky. I believe it should work in all supported languages the same way, but it would very hard to use this ksy to generate writer code. Probably it's easy to implement something like the code suggested in (2) using a syntax like: seq:
- id: elements
type: element
repeat: while
repeat-while: _.marker != 0 I guess it should be more or less understandable and expected by the majority of users. Would it be ok with you? |
|
Just to clarify with example (2),
bool ok;
do {
pos = this._io.pos();
ok = false;
try {
_it = new Element(this._io, this, _root);
if (_it.marker() != 0) { // or some other condition
ok = true;
}
} catch (...) {} //all exceptions which need to be caught
if (ok) {
this.elements.add(_it);
} else {
this._io.seek(pos);
}
} while (ok); A read of |
It's a bad idea to catch all exceptions, especially in low-level-sensitive languages such as C++. I've already mentioned as (3) that a concept of "erroneously read element" is next to non-existent now in KS. For example, typecasting ( So, if you want error catching and ignoring as well, we need to:
For practical purposes, I guess, it's much easier (and cleaner) to just define |
I'm happy with the suggested approach of implementing exceptions and allowing exceptions to be caught/ignored. I realise this is no easy task though! As an interim solution, I was thinking something like the following could be done: element_of_unknown_type:
seq:
- id: element
type:
switch-on: first_byte
cases:
0: element_of_type_a
1: element_of_type_b
2: element_of_type_c
instances:
first_byte:
pos: 0
type: u1 In order for this structure to be used successfully in cases where the size of |
Hi, is anybody working on this enhancement? |
Not at this moment. Probably we'll need to implement some kind of proper exception systems first. |
The absence of this feature is a blocker for me and anybody who deal with variable number of blocks that:
I would rather deal with guessing is this instance of _ suitable or not than have no ability to do it at all ( |
Another case where this is needed from a file I'm trying to parse: I have an array of types:
my_type:
seq:
- id: item1
type: u2
- id: item2
type: u2
- id: item3
type: u2
- id: item4
type: u2 |
CC @qequ |
I'm looking for this feature as well. I'm working on specific format using "null" entry as a separator. Following specification parses format properly, but it includes the null entry in entries. I would like to have entries and "null entry" separated if possible. seq:
- id: entries
type: entry
repeat: until
repeat-until: _.file_name == ""
types:
entry:
seq:
- id: file_name
type: strz
encoding: ASCII
- id: mime_type
size: 4
- id: original_size
type: u4
- id: offset
type: u4
- id: timestamp
type: u4
- id: datasize
type: u4 @davidhicks thanks for the tip, I was able to bypass my problem with looking forward using |
I'm guessing this would solve my use-case too. I am currently unable to parse this:
The desired result would be something like: {
"pages": [
{
"paragraphs": [
{
"textSegments": [
{
"text": "A sentence:"
},
{
"text": "Another sentence."
}
]
},
{
"textSegments": [
{
"text": "Paragraph 2"
}
]
}
]
}
]
} The issue for Kaitai is that the strings have no terminator and no length. In short:
|
In case it might help someone, here is a workaround i used for this enhanced PE parser i'm working on, based on the Kaitai one. NOTE: It will work only if the size of the elements in the repeat block is fixed (no dynamically sized elements) import_descriptor_reader:
seq:
- id: invoke_items_start
size: 0
if: items_start >= 0
- id: buffer
type: import_descriptor
repeat: until
# will read until this condition, including. the instances block below will take care of skipping this item
repeat-until: _.name.value == 0 and _.first_thunk.value == 0
size: sizeof<import_descriptor>
- id: invoke_items_end
size: 0
if: items_end >= 0
instances:
# captures the data starting position
items_start:
value: _io.pos
# captures the data ending position
items_end:
value: _io.pos
# counts how many bytes we read, and excludes the last item
items_size:
value: items_end - sizeof<import_descriptor> - items_start
# converts the size into number of elements
items_count:
value: items_size / sizeof<import_descriptor>
# re-reads the items, this time without the last item
items:
pos: items_start
repeat: expr
repeat-expr: items_count
type: import_descriptor
size: sizeof<import_descriptor> I'm using a similar variation of this workaround to answer the question "find the first element of an array that satisfies a condition", in order to find the PE section that contains the given virtual address. |
Currently
repeat: until
will always read and include the non-matching element. It would be preferable if this behaviour could be toggled (withconsume: false
?) so that it is possible to perform a look-ahead on future bytes in the stream. Because an attempt would need to be made to read the element following the last valid element matching therepeat-until
condition, it is possible that errors may be encountered. In such a case the behaviour should be to catch and ignore look-ahead reading errors, and reset the stream position back to the last known element which was successfully read and passed therepeat-until
condition.The text was updated successfully, but these errors were encountered: