-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ValidationNotEqualError throw HEX-bytes in case of given byte arrays. #7
Make ValidationNotEqualError throw HEX-bytes in case of given byte arrays. #7
Conversation
…tai-io#5) That's what's done in Java as well and HEX-conversion was already available anyway. Bracktes are added by purpose to keep output somewhat compatible with that of Java. This fixes kaitai-io#4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry that I forgot to post this reply to #5 (comment) earlier (I had it half written, but I didn't get to finish and post it), but I don't agree with this implementation.
The following makes the mentioned test succeed
It's not only about making the random test (that revealed this problem) succeed. The problem is that Stream.format_hex
works on any string, not just a byte string:
def format_hex(bytes)
bytes.unpack('H*')[0].gsub(/(..)/, '\1 ').chop
end
puts format_hex('😂').inspect # "f0 9f 98 82"
puts '😂'.encoding # UTF-8
We don't want to print character strings in hex. If someone wants to use valid/eq
on strings, they should be able to do that, and eventually get the validation error message with human-readable strings (does not mean that I can read the one below), not byte hex dumps:
82 B1 82 F1 82 C9 82 BF 82 B1 00
seq:
- id: fixed_str
type: str
terminator: 0
encoding: SJIS
valid:
eq: '"こんにちは"'
It is explicitly declared as a text string, so the message should be like
ValidationNotEqualError: not equal, expected [こんにちは], but got [こんにちこ]
not
ValidationNotEqualError: not equal, expected [82 b1 82 f1 82 c9 82 bf 82 cd 00], but got [82 b1 82 f1 82 c9 82 bf 82 b1 00]
It's also about consistency among target languages that we support. You'll also get the string version in Java for strings (string
does not match byte[]
, but Object
) and the hex version will only show for byte[]
.
I don't think using that approach really 1:1 with different method signatures is available in Ruby
Checking actual.is_a?(String) and expected.is_a?(String) and actual.encoding == Encoding::ASCII_8BIT and expected.encoding == Encoding::ASCII_8BIT
as I suggest is the closest and accurate Ruby equivalent.
In my opinion, all other guesses aren't any better in the end.
Comparing String#encoding
to Encoding::ASCII_8BIT
is not a guess. See https://ruby-doc.org/core-3.0.0/Encoding.html:
Encoding::ASCII_8BIT is a special encoding that is usually used for a byte string, not a character string. But as the name insists, its characters in the range of ASCII are considered as ASCII characters. This is useful when you use ASCII-8BIT characters with other ASCII compatible characters.
If you call this a "guess", your method with catching "NoMethodError" is a shotgun that doesn't even try to precisely aim at the center of a target (because the bullets are dispersed everywhere anyway) and doesn't care what other casualties are shot.
Guten Tag Petr Pučil,
am Montag, 8. März 2021 um 13:37 schrieben Sie:
It's not only about making the random test (that revealed this
problem) succeed. The problem is that `Stream.format_hex` works on
_any_ string, not just a **byte string**:
It works on anything that makes sense to be printed as HEX-bytes by
telling so and providing "unpack". One could easily consider that by design, which would only force me to change the commit message to make that more clear. Relying on automatic stringification of instances is only there because it was good enough to start with as well.
We don't want to print character strings in hex.[...]
It's not only about if you can actually understand text in foreign
characters, it's about IF those characters are even shown properly at
all. Things heavily depend on the current environment like the shell
and its character encoding in use, which is simply ignored when using
automatic stringification.
Ever had to deal with Unicode equivalence and different character
composition e.g. with file names in some ZIP compared to the "same"
names in file systems? Characters would look like exactly the same in
human readable texts, but bytes would still be different and on many
platforms decide if some file name would be found or not. Outputting
HEX-bytes would allow to easily spot ANY difference on byte level and
therefore would make the actual error message being more useful.
Of course there are use cases in which one prefers reading human text
and exactly that makes debugging easier... But when ONLY printing
human readable, somehow automatically rendered text always, one can't
choose anymore as well. Though, it's NOT guaranteed to get any unique
and helpful output at all. HEX-bytes OTOH should always be unique and
correct on a wide variety of setups.
It's also about consistency among target languages that we support.
Target languages aren't consistent anyway currently so things could be
changed in either direction right now. Just look at C++. Where is the
distinction to render different contents differently using which
reasons?
https://github.com/kaitai-io/kaitai_struct_cpp_stl_runtime/blob/4da1668426bd67aaf1932628c1c7db76ff77665c/kaitai/exceptions.h#L106
"Object" and automatic stringification in Java is only used because
it's easily available, it doesn't necessarily mean it's the best
choice. HEX-bytes for "byte[]" OTOH is most likely the best choice and
MIGHT be for all "texts" as well.
There are a lot of people arguing that "texts" should be seen as
byte-arrays preferrably to not run into problems with same looking
characters having in fact different meaning.
https://en.wikipedia.org/wiki/IDN_homograph_attack
https://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode
That's one of the reason many Linux file systems simply store bytes
instead of characters, besides backwards compatibility with ASCII.
If you call this a "guess",
It's clearly a guess and a workaround for language limitations.
Besides the question on how to show "textual content" in error
messages, you can't really negate that.
your method with catching
"NoMethodError" is a shotgun that doesn't even try to precisely aim
at the center of a target (because the bullets are dispersed
everywhere anyway) and doesn't care what other casualties are shot.
Official docs easily provide which types implement "unpack" in which
version as well. There's absolutely no difference than with your
example from the docs, but "unpack" is something one can check
reliably for. ASCII_8BIT OTOH can easily be a valid character encoding
for human readable, textual content instead.
Keep in mind that #116 is still open and ASCII_8BIT not guaranteed to
be an invalid value, so it might be forwarded for textual content for
a long time. Which might result in false-positives as well.
kaitai-io/kaitai_struct#116
Mit freundlichen Grüßen
Thorsten Schöning
…--
AM-SoFT IT-Service - Bitstore Hameln GmbH i.G.
Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK
E-Mail: [email protected]
Web: http://www.AM-SoFT.de/
Tel: 05151- 9468- 0
Tel: 05151- 9468-55
Fax: 05151- 9468-88
Mobil: 0178-8 9468-04
AM-SoFT IT-Service - Bitstore Hameln GmbH i.G., Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB neu - Geschäftsführer: Janine Galonska
Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung.
Mit freundlichen Grüßen
Thorsten Schöning
Tel: 05151 9468 0
Fax: 05151 9468 88
Mobil:
Webseite: https://www.am-soft.de
AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK
AM-Soft IT-Service - Bitstore Hameln GmbH i.G.
Brandenburger Str. 7c
31789 Hameln
Tel: 05151 9468 0
Bitstore IT-Consulting GmbH
Zentrale - Berlin Lichtenberg
Frankfurter Allee 285
10317 Berlin
Tel: 030 453 087 80
CBS IT-Service - Bitstore Kaulsdorf UG
Tel: 030 453 087 880 1
Büro Dallgow-Döberitz
Tel: 03322 507 020
Büro Kloster Lehnin
Tel: 033207 566 530
PCE IT-Service - Bitstore Darmstadt UG
Darmstadt
Tel: 06151 392 973 0
Büro Neuruppin
Tel: 033932 606 090
ACI EDV Systemhaus Dresden GmbH
Dresden
Tel: 0351 254 410
Das Systemhaus - Bitstore Magdeburg GmbH
Magdeburg
Tel: 0391 636 651 0
Allerdata.IT - Bitstore Wittenberg GmbH
Wittenberg
Tel: 03491 876 735 7
Büro Liebenwalde
Tel: 033054 810 00
HSA - das Büro - Bitstore Altenburg UG
Altenburg
Tel: 0344 784 390 97
Bitstore IT – Consulting GmbH
NL Piesteritz
Piesteritz
Tel: 03491 644 868 6
Solltec IT-Services - Bitstore Braunschweig UG
Braunschweig
Tel: 0531 206 068 0
MF Computer Service - Bitstore Gütersloh GmbH
Gütersloh
Tel: 05245 920 809 3
Firmensitz: AM-Soft IT-Service - Bitstore Hameln GmbH i.G. , Brandenburger Str. 7c , 31789 Hameln
Geschäftsführer Janine Galonska
|
…kes sure that a textual error message is thrown instead of HEX-bytes. kaitai-io/kaitai_struct_ruby_runtime#7
…kes sure that a textual error message is thrown instead of HEX-bytes. kaitai-io/kaitai_struct_ruby_runtime#7
I've provided a different implementation to distinguish |
TBH, I must admit that you made fair points. Printing exotic Unicode characters to the console really isn't the best idea. And thanks for pointing the Unicode equivalence, I can see in https://stackoverflow.com/a/33897864 that it can be a really nasty thing. Also, I've just found that the puts [0xe8, 0x8a, 0xb1].pack('C*').force_encoding('UTF-8') # => 花
puts [0xe8, 0x8a, 0xb1].pack('C*').force_encoding('UTF-8').inspect # => "花"
puts [0x89, 0xd4].pack('C*').force_encoding('SJIS') # => 花
puts [0x89, 0xd4].pack('C*').force_encoding('SJIS').inspect # => "\x{89D4}" I've also done a simple test whether the same strings with different encoding are equal to each other in Ruby, and they aren't:
meta:
id: test
encoding: SJIS
seq:
- id: magic
type: str
size-eos: true
valid:
eq: '"花"'
pp@DESKTOP-MIPASSQ MINGW64 /c/temp/ruby-shift-jis-valid-eq-str
$ ksv shift-jis-char.bin test.ksy
Compilation OK
... processing test.ksy 0
...... loading test.rb
Classes loaded OK, main class = Test
Traceback (most recent call last):
5: from C:/Ruby27-x64/bin/ksv:23:in `<main>'
4: from C:/Ruby27-x64/bin/ksv:23:in `load'
3: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/kaitai-struct-visualizer-0.7/bin/ksv:53:in `<top (required)>'
2: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/kaitai-struct-visualizer-0.7/lib/kaitai/struct/visualizer/visualizer.rb:13:in `run'
1: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/kaitai-struct-visualizer-0.7/lib/kaitai/struct/visualizer/parser.rb:32:in `load'
C:/Users/pp/AppData/Local/Temp/d20210308-12888-58e1pn/test.rb:21:in `_read': /seq/0: at pos 2: validation failed: not equal, expected "花", but got "\\x{89D4}" (Kaitai::Struct::ValidationNotEqualError) One would need to do - valid:
- eq: '"花"'
+ valid:
+ eq: '[0x89, 0xd4].to_s("SJIS")' # "花" Which is something that makes Also the string literal puts "\x{89D4}" # invalid hex escape It also doesn't match anything on https://docs.ruby-lang.org/en/2.4.0/syntax/literals_rdoc.html#label-Strings.
Fair enough, I agree. It would be best to only use ASCII characters in error messages for best compatibility and portability.
This is all very nice, and I would really be inclined to the hex byte solution. Even though you're parsing and comparing text strings, you are able to show their original byte representation as if they were created as byte arrays from the very beginning. This can be indeed very useful, because you can see the exact actual bytes that are present in the hex dump and you also see what these bytes would have to match (the expected value) to pass the validation. However, this is something only applicable to Ruby and maybe a few other languages that use the "string is a sequence of bytes" paradigm (I can think of C++, PHP and Lua). In other languages, "text string" is just a string for text, without any guarantee that what bytes you use for its creation will be preserved. For example, in JavaScript all strings internally use UTF-16 encoding, and there's no way to change this. Therefore, although you of course can convert these strings to bytes in whatever encoding you like, you have no means to figure out what encoding you should use to match the original byte representation in the hex dump. Unless you transport the information about encoding along with each string, just as Ruby does (in JavaScript, I can imagine that it could be done really simple with replacing all strings with a class storing the string and the encoding, which would implement the method |
Thanks, but that doesn't mean we need to make things more difficult than they are already right now. Showing human readable texts, numbers etc. totally makes sense pretty often and the current distinction of So I suggest merging the current implementation if it is what you had in mind earlier anyway and change as necessary in future. Our different arguments won't go away and can be used in additional issues as necessary. |
Guten Tag Petr Pučil,
am Dienstag, 9. März 2021 um 23:38 schrieben Sie:
> + # but especially in the context of KS things are differently
> likely. This especially means that an
Um, what do you mean by that
That should have simply meant that KS won't allow that encoding for
human readable texts most likely and is the reason I linked the
corresponding issue/discussion. I think I made that more clear now by
including your phrase as well.
Mit freundlichen Grüßen
Thorsten Schöning
…--
AM-SoFT IT-Service - Bitstore Hameln GmbH i.G.
Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK
E-Mail: [email protected]
Web: http://www.AM-SoFT.de/
Tel: 05151- 9468- 0
Tel: 05151- 9468-55
Fax: 05151- 9468-88
Mobil: 0178-8 9468-04
AM-SoFT IT-Service - Bitstore Hameln GmbH i.G., Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB neu - Geschäftsführer: Janine Galonska
Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung.
Mit freundlichen Grüßen
Thorsten Schöning
Tel: 05151 9468 0
Fax: 05151 9468 88
Mobil:
Webseite: https://www.am-soft.de
AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK
AM-Soft IT-Service - Bitstore Hameln GmbH i.G.
Brandenburger Str. 7c
31789 Hameln
Tel: 05151 9468 0
Bitstore IT-Consulting GmbH
Zentrale - Berlin Lichtenberg
Frankfurter Allee 285
10317 Berlin
Tel: 030 453 087 80
CBS IT-Service - Bitstore Kaulsdorf UG
Tel: 030 453 087 880 1
Büro Dallgow-Döberitz
Tel: 03322 507 020
Büro Kloster Lehnin
Tel: 033207 566 530
PCE IT-Service - Bitstore Darmstadt UG
Darmstadt
Tel: 06151 392 973 0
Büro Neuruppin
Tel: 033932 606 090
ACI EDV Systemhaus Dresden GmbH
Dresden
Tel: 0351 254 410
Das Systemhaus - Bitstore Magdeburg GmbH
Magdeburg
Tel: 0391 636 651 0
Allerdata.IT - Bitstore Wittenberg GmbH
Wittenberg
Tel: 03491 876 735 7
Büro Liebenwalde
Tel: 033054 810 00
HSA - das Büro - Bitstore Altenburg UG
Altenburg
Tel: 0344 784 390 97
Bitstore IT – Consulting GmbH
NL Piesteritz
Piesteritz
Tel: 03491 644 868 6
Solltec IT-Services - Bitstore Braunschweig UG
Braunschweig
Tel: 0531 206 068 0
MF Computer Service - Bitstore Gütersloh GmbH
Gütersloh
Tel: 05245 920 809 3
Firmensitz: AM-Soft IT-Service - Bitstore Hameln GmbH i.G. , Brandenburger Str. 7c , 31789 Hameln
Geschäftsführer Janine Galonska
|
Guten Tag Petr Pučil,
am Mittwoch, 10. März 2021 um 13:37 schrieben Sie:
```ruby
if Stream.is_byte_array?(expected) and Stream.is_byte_array?(actual)
```
The current function works that way already, thats why varargs are
used. Providing both args in one call makes call-site shorter.
```suggestion
# Guess if the given args are most likely byte arrays.
```
Changed.
Mit freundlichen Grüßen
Thorsten Schöning
…--
AM-SoFT IT-Service - Bitstore Hameln GmbH i.G.
Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK
E-Mail: [email protected]
Web: http://www.AM-SoFT.de/
Tel: 05151- 9468- 0
Tel: 05151- 9468-55
Fax: 05151- 9468-88
Mobil: 0178-8 9468-04
AM-SoFT IT-Service - Bitstore Hameln GmbH i.G., Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB neu - Geschäftsführer: Janine Galonska
Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung.
Mit freundlichen Grüßen
Thorsten Schöning
Tel: 05151 9468 0
Fax: 05151 9468 88
Mobil:
Webseite: https://www.am-soft.de
AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK
AM-Soft IT-Service - Bitstore Hameln GmbH i.G.
Brandenburger Str. 7c
31789 Hameln
Tel: 05151 9468 0
Bitstore IT-Consulting GmbH
Zentrale - Berlin Lichtenberg
Frankfurter Allee 285
10317 Berlin
Tel: 030 453 087 80
CBS IT-Service - Bitstore Kaulsdorf UG
Tel: 030 453 087 880 1
Büro Dallgow-Döberitz
Tel: 03322 507 020
Büro Kloster Lehnin
Tel: 033207 566 530
PCE IT-Service - Bitstore Darmstadt UG
Darmstadt
Tel: 06151 392 973 0
Büro Neuruppin
Tel: 033932 606 090
ACI EDV Systemhaus Dresden GmbH
Dresden
Tel: 0351 254 410
Das Systemhaus - Bitstore Magdeburg GmbH
Magdeburg
Tel: 0391 636 651 0
Allerdata.IT - Bitstore Wittenberg GmbH
Wittenberg
Tel: 03491 876 735 7
Büro Liebenwalde
Tel: 033054 810 00
HSA - das Büro - Bitstore Altenburg UG
Altenburg
Tel: 0344 784 390 97
Bitstore IT – Consulting GmbH
NL Piesteritz
Piesteritz
Tel: 03491 644 868 6
Solltec IT-Services - Bitstore Braunschweig UG
Braunschweig
Tel: 0531 206 068 0
MF Computer Service - Bitstore Gütersloh GmbH
Gütersloh
Tel: 05245 920 809 3
Firmensitz: AM-Soft IT-Service - Bitstore Hameln GmbH i.G. , Brandenburger Str. 7c , 31789 Hameln
Geschäftsführer Janine Galonska
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Following the comment kaitai-io/kaitai_struct_ruby_runtime#7 (comment) Test that the string parsed from stream in the specified encoding is equal to the appropriate literal UTF-8 counterpart. This is designed to point out a problem in Ruby, where each string is represented as a byte array holding the characters in specified encoding. Testing the same strings for equality will then fail if their byte representations are not the same (i.e. if they use encodings that represent chars differently).
As can be read in #4, the current implementation of ValidationNotEqualError doesn't output HEX-bytes like in Java, even though those would be better for debugging etc. most likely. This PR brings that output in line with Java, by keeping backwards compatibility with given strings or other objects NOT being byte arrays at all.
The current implementation doesn't need any guessing about the given type, it can either be handled as a byte array or that fails and the former implementation is used instead. This is pretty much the same approach like is implemented in Java: Two different CTORs exist, one explicitly targeting
byte[]
and the other one being a fallback for everything else usingObject
.I don't think using that approach really 1:1 with different method signatures is available in Ruby, so catching an exception seems to be the closest one can get instead. While using exceptions for normal code flow is discouraged in most cases, keep in mind that we have an error condition here already and this way things don't depend on checking encodings or stuff like that. In my opinion, all other guesses aren't any better in the end.
Details why this is necessary can be read at the following discussion:
#5 (review)
This fixes #4 (again).