v2: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError) #8

yob · 2024-09-01T11:18:34Z

Thanks for maintaining this library ❤️

I noticed that #7 helped to prompt a v2 release, and over in yob/pdf-reader#538 I've had a suggestion to relax the pdf-reader dependencies to allow v2 to be used.

I gave it a go, but the CI build on ruby versions that installed Ascii85 v2 failed, for example: https://buildkite.com/yob-opensource/pdf-reader/builds/629#0191ac89-884b-4fc1-ace3-3f1a7b11258a

The input data was pulled from a test PDF and is hard to work with for a reproduction, so I trimmed the sample down and put together a short script:

# coding: utf-8

require "bundler/inline"

gemfile do
  source "https://rubygems.org"

  #gem 'Ascii85', '1.1.1'
  gem 'Ascii85', '2.0.0'
end

require 'ascii85'

data = %Q{<~8;Xu[gMYb*&H)\\/_`Xe]6AlP#",[.!-NsinYhkAeQYmmYS$ojXOU~>}
puts data

puts "*****************************"
puts "input utf8"
puts "*****************************"

puts data.encoding
puts data.valid_encoding?

res = Ascii85.decode(data)
puts res.inspect

If I flip the Ascii85 version between v1 and v2: the input data works on v1.1.1 and raises an exception on v2.0.0:

$ ruby repro.rb 
<~8;Xu[gMYb*&H)\/_`Xe]6AlP#",[.!-NsinYhkAeQYmmYS$ojXOU~>
*****************************
input utf8
*****************************
UTF-8
true
/home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:345:in `write': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from /home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:298:in `decode_raw'
        from /home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:192:in `decode'
        from repro.rb:24:in `<main>'

The output data is expected to be binary and not valid UTF-8. I assume I might be able to work around it by using the new v2 API to pass in a binary encoded output buffer, however pdf-reader still supports rubies < 2.7 so I'm aiming to use the v1 compatible parts of Ascii85s API

The text was updated successfully, but these errors were encountered:

DataWraith · 2024-09-02T19:04:21Z

Thank you for such an outstanding bug report! The script made it really easy to reproduce the problem.

I was too confident in my specs protecting me from this kind of issue, but I have insufficiently tested binary data it seems.

I managed to distill your example down into a short string that triggers the issue, <~S$ojXOT~> (OU and OT are equivalent because the last bits get chopped off, but the gem produces OT when encoding the data), but alas I have now run out of time for today.

I think the issue can mostly be solved by spamming force_encoding(Encoding::ASCII_8BIT) throughout the code, to make sure that the gem always uses the BINARY encoding instead of the default UTF-8, but that is a rather ugly solution.

Still, I will shortly push a commit doing just that, and it at least makes the example pass -- but I'm not 100% sure if I managed to catch every instance of the problem.

I probably won't be able to work on this again before Wednesday; I'll try to see if I can uncover more edge cases that can lead to problems then.

yob · 2024-09-03T10:56:16Z

Sounds good. There's no urgency from my perspective, released versions of pdf-reader are locked to v1.x so they're continuing to work fine.

I can see a fix has been pushed to main, so I gave it a go (https://github.com/yob/pdf-reader/compare/ascii85-2-0?expand=1). The pdf-reader spec suite is green (some jobs failed, but for unrelated reasons): Here's a passing example, on ruby 3.3 https://buildkite.com/yob-opensource/pdf-reader/builds/630#0191b783-766d-4fa6-b7e4-b9583a832f1e

- Bump PORTREVISION for package change Obtained from: yob/pdf-reader@cb6f8ed Reference: yob/pdf-reader#538 DataWraith/ascii85gem#8 yob/pdf-reader@main...ascii85-2-0 DataWraith/ascii85gem@b7480db

DataWraith · 2024-09-11T10:20:26Z

Thank you for testing the changes!

I went through the code again on the weekend and made sure that all String literals are unfrozen and encoded as ASCII_8BIT before use; that should take care of the encoding errors.

The new version has also managed to correctly encode and then decode a few gigabytes of random binary data without raising an Exception, so I hope that it works properly now.

Unless something else crops up, I'll probably release version 2.0.1 this weekend.

But not 2.0.0, it has some encoding issues with binary data DataWraith/ascii85gem#8

yob · 2024-11-02T00:37:38Z

Thanks for your help here! I've released pdf-reader with a relaxed Ascii85 dependency and all our tests are green ❤️

yob mentioned this issue Sep 1, 2024

allow Ascii85 2.0? yob/pdf-reader#538

Closed

DataWraith added a commit that referenced this issue Sep 2, 2024

specs: Add failing spec for github issue #8

b7480db

yob added a commit to yob/pdf-reader that referenced this issue Nov 2, 2024

Allow Ascii81 1.0 and 2.0

855d067

But not 2.0.0, it has some encoding issues with binary data DataWraith/ascii85gem#8

yob mentioned this issue Nov 2, 2024

Allow Ascii81 1.0 and 2.0 yob/pdf-reader#539

Merged

yob closed this as completed Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError) #8

v2: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError) #8

yob commented Sep 1, 2024

DataWraith commented Sep 2, 2024

yob commented Sep 3, 2024

DataWraith commented Sep 11, 2024

yob commented Nov 2, 2024

v2: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError) #8

v2: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError) #8

Comments

yob commented Sep 1, 2024

DataWraith commented Sep 2, 2024

yob commented Sep 3, 2024

DataWraith commented Sep 11, 2024

yob commented Nov 2, 2024