Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError) #8

Closed
yob opened this issue Sep 1, 2024 · 4 comments

Comments

@yob
Copy link

yob commented Sep 1, 2024

Thanks for maintaining this library ❤️

I noticed that #7 helped to prompt a v2 release, and over in yob/pdf-reader#538 I've had a suggestion to relax the pdf-reader dependencies to allow v2 to be used.

I gave it a go, but the CI build on ruby versions that installed Ascii85 v2 failed, for example: https://buildkite.com/yob-opensource/pdf-reader/builds/629#0191ac89-884b-4fc1-ace3-3f1a7b11258a

The input data was pulled from a test PDF and is hard to work with for a reproduction, so I trimmed the sample down and put together a short script:

# coding: utf-8

require "bundler/inline"

gemfile do
  source "https://rubygems.org"

  #gem 'Ascii85', '1.1.1'
  gem 'Ascii85', '2.0.0'
end

require 'ascii85'

data = %Q{<~8;Xu[gMYb*&H)\\/_`Xe]6AlP#",[.!-NsinYhkAeQYmmYS$ojXOU~>}
puts data

puts "*****************************"
puts "input utf8"
puts "*****************************"

puts data.encoding
puts data.valid_encoding?

res = Ascii85.decode(data)
puts res.inspect

If I flip the Ascii85 version between v1 and v2: the input data works on v1.1.1 and raises an exception on v2.0.0:

$ ruby repro.rb 
<~8;Xu[gMYb*&H)\/_`Xe]6AlP#",[.!-NsinYhkAeQYmmYS$ojXOU~>
*****************************
input utf8
*****************************
UTF-8
true
/home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:345:in `write': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from /home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:298:in `decode_raw'
        from /home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:192:in `decode'
        from repro.rb:24:in `<main>'

The output data is expected to be binary and not valid UTF-8. I assume I might be able to work around it by using the new v2 API to pass in a binary encoded output buffer, however pdf-reader still supports rubies < 2.7 so I'm aiming to use the v1 compatible parts of Ascii85s API

@DataWraith
Copy link
Owner

Thank you for such an outstanding bug report! The script made it really easy to reproduce the problem.

I was too confident in my specs protecting me from this kind of issue, but I have insufficiently tested binary data it seems.

I managed to distill your example down into a short string that triggers the issue, <~S$ojXOT~> (OU and OT are equivalent because the last bits get chopped off, but the gem produces OT when encoding the data), but alas I have now run out of time for today.

I think the issue can mostly be solved by spamming force_encoding(Encoding::ASCII_8BIT) throughout the code, to make sure that the gem always uses the BINARY encoding instead of the default UTF-8, but that is a rather ugly solution.

Still, I will shortly push a commit doing just that, and it at least makes the example pass -- but I'm not 100% sure if I managed to catch every instance of the problem.

I probably won't be able to work on this again before Wednesday; I'll try to see if I can uncover more edge cases that can lead to problems then.

@yob
Copy link
Author

yob commented Sep 3, 2024

Sounds good. There's no urgency from my perspective, released versions of pdf-reader are locked to v1.x so they're continuing to work fine.

I can see a fix has been pushed to main, so I gave it a go (https://github.com/yob/pdf-reader/compare/ascii85-2-0?expand=1). The pdf-reader spec suite is green (some jobs failed, but for unrelated reasons): Here's a passing example, on ruby 3.3 https://buildkite.com/yob-opensource/pdf-reader/builds/630#0191b783-766d-4fa6-b7e4-b9583a832f1e

@DataWraith
Copy link
Owner

Thank you for testing the changes!

I went through the code again on the weekend and made sure that all String literals are unfrozen and encoded as ASCII_8BIT before use; that should take care of the encoding errors.

The new version has also managed to correctly encode and then decode a few gigabytes of random binary data without raising an Exception, so I hope that it works properly now.

Unless something else crops up, I'll probably release version 2.0.1 this weekend.

yob added a commit to yob/pdf-reader that referenced this issue Nov 2, 2024
But not 2.0.0, it has some encoding issues with binary data

DataWraith/ascii85gem#8
@yob
Copy link
Author

yob commented Nov 2, 2024

Thanks for your help here! I've released pdf-reader with a relaxed Ascii85 dependency and all our tests are green ❤️

@yob yob closed this as completed Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants