Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ValidationNotEqualError throw HEX-bytes in case of given byte arrays. #7

Merged

Conversation

ams-tschoening
Copy link
Contributor

@ams-tschoening ams-tschoening commented Mar 8, 2021

As can be read in #4, the current implementation of ValidationNotEqualError doesn't output HEX-bytes like in Java, even though those would be better for debugging etc. most likely. This PR brings that output in line with Java, by keeping backwards compatibility with given strings or other objects NOT being byte arrays at all.

The current implementation doesn't need any guessing about the given type, it can either be handled as a byte array or that fails and the former implementation is used instead. This is pretty much the same approach like is implemented in Java: Two different CTORs exist, one explicitly targeting byte[] and the other one being a fallback for everything else using Object.

    public static class ValidationNotEqualError extends ValidationFailedError {
        public ValidationNotEqualError(byte[] expected, byte[] actual, KaitaiStream io, String srcPath) {
            super("not equal, expected " + byteArrayToHex(expected) + ", but got " + byteArrayToHex(actual), io, srcPath);
        }

        public ValidationNotEqualError(Object expected, Object actual, KaitaiStream io, String srcPath) {
            super("not equal, expected " + expected + ", but got " + actual, io, srcPath);
        }

        protected Object expected;
        protected Object actual;
    }

I don't think using that approach really 1:1 with different method signatures is available in Ruby, so catching an exception seems to be the closest one can get instead. While using exceptions for normal code flow is discouraged in most cases, keep in mind that we have an error condition here already and this way things don't depend on checking encodings or stuff like that. In my opinion, all other guesses aren't any better in the end.

Details why this is necessary can be read at the following discussion:

#5 (review)

This fixes #4 (again).

ams-tschoening and others added 2 commits March 3, 2021 09:38
…tai-io#5)

That's what's done in Java as well and HEX-conversion was already available anyway. Bracktes are added by purpose to keep output somewhat compatible with that of Java.

This fixes kaitai-io#4.
Copy link
Member

@generalmimon generalmimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry that I forgot to post this reply to #5 (comment) earlier (I had it half written, but I didn't get to finish and post it), but I don't agree with this implementation.

The following makes the mentioned test succeed

It's not only about making the random test (that revealed this problem) succeed. The problem is that Stream.format_hex works on any string, not just a byte string:

def format_hex(bytes)
  bytes.unpack('H*')[0].gsub(/(..)/, '\1 ').chop
end
puts format_hex('😂').inspect # "f0 9f 98 82"
puts '😂'.encoding # UTF-8

We don't want to print character strings in hex. If someone wants to use valid/eq on strings, they should be able to do that, and eventually get the validation error message with human-readable strings (does not mean that I can read the one below), not byte hex dumps:

82 B1 82 F1 82 C9 82 BF 82 B1 00
seq:
  - id: fixed_str
    type: str
    terminator: 0
    encoding: SJIS
    valid:
      eq: '"こんにちは"'

It is explicitly declared as a text string, so the message should be like

ValidationNotEqualError: not equal, expected [こんにちは], but got [こんにちこ]

not

ValidationNotEqualError: not equal, expected [82 b1 82 f1 82 c9 82 bf 82 cd 00], but got [82 b1 82 f1 82 c9 82 bf 82 b1 00]

It's also about consistency among target languages that we support. You'll also get the string version in Java for strings (string does not match byte[], but Object) and the hex version will only show for byte[].


I don't think using that approach really 1:1 with different method signatures is available in Ruby

Checking actual.is_a?(String) and expected.is_a?(String) and actual.encoding == Encoding::ASCII_8BIT and expected.encoding == Encoding::ASCII_8BIT as I suggest is the closest and accurate Ruby equivalent.

In my opinion, all other guesses aren't any better in the end.

Comparing String#encoding to Encoding::ASCII_8BIT is not a guess. See https://ruby-doc.org/core-3.0.0/Encoding.html:

Encoding::ASCII_8BIT is a special encoding that is usually used for a byte string, not a character string. But as the name insists, its characters in the range of ASCII are considered as ASCII characters. This is useful when you use ASCII-8BIT characters with other ASCII compatible characters.

If you call this a "guess", your method with catching "NoMethodError" is a shotgun that doesn't even try to precisely aim at the center of a target (because the bullets are dispersed everywhere anyway) and doesn't care what other casualties are shot.

@ams-tschoening
Copy link
Contributor Author

ams-tschoening commented Mar 8, 2021 via email

ams-tschoening added a commit to ams-ts-kaitai/tests that referenced this pull request Mar 8, 2021
…kes sure that a textual error message is thrown instead of HEX-bytes.

kaitai-io/kaitai_struct_ruby_runtime#7
ams-tschoening added a commit to ams-ts-kaitai/tests that referenced this pull request Mar 8, 2021
…kes sure that a textual error message is thrown instead of HEX-bytes.

kaitai-io/kaitai_struct_ruby_runtime#7
@ams-tschoening
Copy link
Contributor Author

I've provided a different implementation to distinguish byte[] from human readable texts and a corresponding test making sure that the correct error message is given by checking for human readable text. This makes it easier to track a possibly different decision to output HEX instead of chars in future as well, because the concrete test with different error messages would fail then.

kaitai-io/kaitai_struct_tests#93

@generalmimon
Copy link
Member

TBH, I must admit that you made fair points. Printing exotic Unicode characters to the console really isn't the best idea. And thanks for pointing the Unicode equivalence, I can see in https://stackoverflow.com/a/33897864 that it can be a really nasty thing.

Also, I've just found that the String#inspect method in Ruby works quite clumsily - the same character is either "escaped" or not escaped depending on its encoding:

puts [0xe8, 0x8a, 0xb1].pack('C*').force_encoding('UTF-8') # => 花
puts [0xe8, 0x8a, 0xb1].pack('C*').force_encoding('UTF-8').inspect # => "花"

puts [0x89, 0xd4].pack('C*').force_encoding('SJIS') # => 花
puts [0x89, 0xd4].pack('C*').force_encoding('SJIS').inspect # => "\x{89D4}"

I've also done a simple test whether the same strings with different encoding are equal to each other in Ruby, and they aren't:

test.ksy

meta:
  id: test
  encoding: SJIS
seq:
  - id: magic
    type: str
    size-eos: true
    valid:
      eq: '"花"'

shift-jis-char.bin

89 D4
pp@DESKTOP-MIPASSQ MINGW64 /c/temp/ruby-shift-jis-valid-eq-str
$ ksv shift-jis-char.bin test.ksy
Compilation OK
... processing test.ksy 0
...... loading test.rb
Classes loaded OK, main class = Test
Traceback (most recent call last):
        5: from C:/Ruby27-x64/bin/ksv:23:in `<main>'
        4: from C:/Ruby27-x64/bin/ksv:23:in `load'
        3: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/kaitai-struct-visualizer-0.7/bin/ksv:53:in `<top (required)>'
        2: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/kaitai-struct-visualizer-0.7/lib/kaitai/struct/visualizer/visualizer.rb:13:in `run'
        1: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/kaitai-struct-visualizer-0.7/lib/kaitai/struct/visualizer/parser.rb:32:in `load'
C:/Users/pp/AppData/Local/Temp/d20210308-12888-58e1pn/test.rb:21:in `_read': /seq/0: at pos 2: validation failed: not equal, expected "花", but got "\\x{89D4}" (Kaitai::Struct::ValidationNotEqualError)

One would need to do

-    valid:
-      eq: '"花"'
+    valid:
+      eq: '[0x89, 0xd4].to_s("SJIS")' # "花"

Which is something that makes valid/eq on strings quite useless at the moment, and we'll need to fix this as well (not that it would be much related to this issue, just wanted to write it down somewhere).

Also the string literal "\x{89D4}" is totally unusable - I don't understand what Ruby developers wanted to tell with it. Ruby interpreter does not recognize this "literal" at all:

puts "\x{89D4}" # invalid hex escape

It also doesn't match anything on https://docs.ruby-lang.org/en/2.4.0/syntax/literals_rdoc.html#label-Strings. 89 D4 is the byte representation of the character 花 in Shift-JIS, but there doesn't seem to be way to use these bytes directly in a string literal without needing to do something like .encode or .force_encoding afterwards.


it's about IF those characters are even shown properly at all. Things heavily depend on the current environment like the shell and its character encoding in use, which is simply ignored when using automatic stringification.

Fair enough, I agree. It would be best to only use ASCII characters in error messages for best compatibility and portability.

Characters would look like exactly the same in human readable texts, but bytes would still be different and on many platforms decide if some file name would be found or not. Outputting HEX-bytes would allow to easily spot ANY difference on byte level and therefore would make the actual error message being more useful.

Of course there are use cases in which one prefers reading human text and exactly that makes debugging easier... But when ONLY printing human readable, somehow automatically rendered text always, one can't choose anymore as well. Though, it's NOT guaranteed to get any unique and helpful output at all. HEX-bytes OTOH should always be unique and correct on a wide variety of setups.

This is all very nice, and I would really be inclined to the hex byte solution. Even though you're parsing and comparing text strings, you are able to show their original byte representation as if they were created as byte arrays from the very beginning. This can be indeed very useful, because you can see the exact actual bytes that are present in the hex dump and you also see what these bytes would have to match (the expected value) to pass the validation.

However, this is something only applicable to Ruby and maybe a few other languages that use the "string is a sequence of bytes" paradigm (I can think of C++, PHP and Lua). In other languages, "text string" is just a string for text, without any guarantee that what bytes you use for its creation will be preserved. For example, in JavaScript all strings internally use UTF-16 encoding, and there's no way to change this. Therefore, although you of course can convert these strings to bytes in whatever encoding you like, you have no means to figure out what encoding you should use to match the original byte representation in the hex dump. Unless you transport the information about encoding along with each string, just as Ruby does (in JavaScript, I can imagine that it could be done really simple with replacing all strings with a class storing the string and the encoding, which would implement the method valueOf that is automatically called by JavaScript whenever a string is needed - I doubt that something like that can be done in other languages, though).

@ams-tschoening
Copy link
Contributor Author

TBH, I must admit that you made fair points.[...]

Thanks, but that doesn't mean we need to make things more difficult than they are already right now. Showing human readable texts, numbers etc. totally makes sense pretty often and the current distinction of byte[] vs. string and anything else makes that available. It's even backwards compatible, which in itself might be a good thing already.

So I suggest merging the current implementation if it is what you had in mind earlier anyway and change as necessary in future. Our different arguments won't go away and can be used in additional issues as necessary.

@ams-tschoening
Copy link
Contributor Author

ams-tschoening commented Mar 10, 2021 via email

lib/kaitai/struct/struct.rb Show resolved Hide resolved
lib/kaitai/struct/struct.rb Outdated Show resolved Hide resolved
@ams-tschoening
Copy link
Contributor Author

ams-tschoening commented Mar 10, 2021 via email

Copy link
Member

@generalmimon generalmimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@generalmimon generalmimon merged commit ea64220 into kaitai-io:master Apr 20, 2021
@ams-tschoening ams-tschoening deleted the ghi_4_hex_bytes_vs_letters branch April 21, 2021 08:19
generalmimon added a commit to kaitai-io/kaitai_struct_tests that referenced this pull request Jul 1, 2021
Following the comment kaitai-io/kaitai_struct_ruby_runtime#7 (comment)

Test that the string parsed from stream in the specified encoding
is equal to the appropriate literal UTF-8 counterpart.

This is designed to point out a problem in Ruby, where each string
is represented as a byte array holding the characters in specified
encoding. Testing the same strings for equality will then fail if
their byte representations are not the same (i.e. if they use
encodings that represent chars differently).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Kaitai::Struct::ValidationNotEqualError" throws letters instead of HEX-bytes like in Java.
2 participants