Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it #118

dentarg · 2016-09-19T13:47:13Z

If your environment fails to specify UTF-8, Ruby defaults to US-ASCII and when public_suffix try to parse the list data, it fails:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `strip!'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `block (2 levels) in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `each_line'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `block in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:128:in `initialize'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `new'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `parse'
    from (irb):1
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):002:0> Encoding.default_external
=> #<Encoding:US-ASCII>
irb(main):003:0> RUBY_VERSION
=> "2.2.5"
irb(main):004:0>

Passing encoding: Encoding::UTF_8 to File.read makes it work, even if the default encoding isn't UTF-8:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
=> nil
irb(main):002:0> RUBY_VERSION
=> "2.2.5"
irb(main):003:0> Encoding.default_external
=> #<Encoding:US-ASCII>

Related to #94 (maybe the list data has changed since?)

The text was updated successfully, but these errors were encountered:

weppos · 2016-10-15T12:36:49Z

Thankis @dentarg, I'll investigate. Are you able to tell me which line in the definition file is causing the issue?

dentarg · 2016-10-16T17:31:36Z

@weppos I hope this help (I'm in a hurry now, so I haven't checked this too closely)

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; nil
=> nil
irb(main):002:0> list_data.class
=> String
irb(main):007:0> ctr = 0 ; outside_line = "" ; list_data.each_line { |line| ctr += 1 ; outside_line = line ; line.strip! } ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from (irb):7:in `strip!'
    from (irb):7:in `block in irb_binding'
    from (irb):7:in `each_line'
    from (irb):7
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):008:0> ctr
=> 610
irb(main):009:0> outside_line
=> "\xE5\x85\xAC\xE5\x8F\xB8.cn\n"

dentarg · 2016-10-16T17:32:25Z

This was with 2.0.3:

irb(main):010:0> PublicSuffix::List::DEFAULT_LIST_PATH
=> "/Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.3/lib/public_suffix/../../data/list.txt"

dentarg · 2016-10-16T22:07:05Z

Hmm... maybe I was naive to believe that everything would be good by File.read with encoding: Encoding::UTF_8 just because it doesn't raise any exception. Seems like "网络.cn\n" is read as "\u7F51\u7EDC.cn\n". This is on OS X 10.11.6, Ruby 2.2.5, zsh 5.0.8, public_suffix-2.0.3. I don't think I fully understand all the LANG, LANGUAGE, LC_* business.

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610]
=> "\u7F51\u7EDC.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610].strip!
=> "\u7F51\u7EDC.cn"
irb(main):004:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "\xE7\xBD\x91\xE7\xBB\x9C.cn\n"
irb(main):005:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
ArgumentError: invalid byte sequence in US-ASCII
    from (irb):5:in `strip!'
    from (irb):5
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):006:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["", "", "", ""]

$ irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "网络.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
=> "网络.cn"
irb(main):004:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8"]

tamoyal · 2018-09-08T16:55:09Z

I'm having this problem with version 3.0.3

SeanDunford · 2019-04-03T21:45:11Z

Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos

weppos · 2019-04-04T08:08:13Z

Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos

It is not dead. If your operating environment is set with the correct UTF8 language value, the library will work perfectly.

aleksandrs-ledovskis · 2019-04-04T11:35:10Z

FWIW, it would seem correct if gem wouldn't depend/be agnostic to any environment setups for nominal operation.

weppos · 2019-04-04T12:29:57Z

@SeanDunford @aleksandrs-ledovskis feel free to provide a patch and I will review it. So far, the only one that provided a practical help was @dentarg but even him admitted the problem may not be that easy to solve.

Frankly, I am reluctant to put any effort into trying to make UTF-8 work because the real solution is to pre-process the list and have it stored in Punycode as this is how names should be managed and compared.

It's just not a the top of my priorities right now. PRs are always welcome.

alexef · 2021-02-05T10:16:14Z

This is still broken in 4.0.3 on ruby:2.4-slim-buster docker image.

A workaround is setting: LANG=en_US.UTF-8 LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 before calling ruby.

dentarg · 2021-02-05T12:27:07Z

Looks like LANG=C.UTF-8 is enough, the Docker images for Ruby >= 2.5 sets that

$ docker run --rm ruby:2.4-slim-buster env
PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=2ea0e1a03e36
RUBY_MAJOR=2.4
RUBY_VERSION=2.4.10
RUBY_DOWNLOAD_SHA256=d5668ed11544db034f70aec37d11e157538d639ed0d0a968e2f587191fc530df
RUBYGEMS_VERSION=3.0.3
GEM_HOME=/usr/local/bundle
BUNDLE_SILENCE_ROOT_WARNING=1
BUNDLE_APP_CONFIG=/usr/local/bundle
HOME=/root

vs

$ docker run --rm ruby:2.5-slim-buster env
PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=7d11ed52a0af
LANG=C.UTF-8
RUBY_MAJOR=2.5
RUBY_VERSION=2.5.8
RUBY_DOWNLOAD_SHA256=0391b2ffad3133e274469f9953ebfd0c9f7c186238968cbdeeb0651aa02a4d6d
RUBYGEMS_VERSION=3.0.3
GEM_HOME=/usr/local/bundle
BUNDLE_SILENCE_ROOT_WARNING=1
BUNDLE_APP_CONFIG=/usr/local/bundle
HOME=/root

Running my initial example

# publicsuffix.rb
require 'bundler/inline'
gemfile do
  source 'https://rubygems.org'
  gem 'public_suffix'
end
puts RUBY_VERSION
puts PublicSuffix::List::DEFAULT_LIST_PATH
list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH)
PublicSuffix::List.parse(list_data, private_domains: false)

In ruby:2.4-slim-buster

$ docker run --rm -it -v $(pwd):/app -w /app ruby:2.4-slim-buster bash
root@aa7eb67dce29:/app# gem install bundler
Fetching bundler-2.2.8.gem
Successfully installed bundler-2.2.8
1 gem installed
root@aa7eb67dce29:/app# ruby publicsuffix.rb
2.4.10
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
	from publicsuffix.rb:9:in `<main>'
root@aa7eb67dce29:/app# LANG=C.UTF-8 ruby publicsuffix.rb
2.4.10
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt

In ruby:2.5-slim-buster

$ docker run --rm -it -v $(pwd):/app -w /app ruby:2.5-slim-buster bash
root@b87a1b578bbf:/app# ruby publicsuffix.rb
2.5.8
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt

The problematic code in public_suffix is PublicSuffix::List.default

publicsuffix-ruby/lib/public_suffix/list.rb

Lines 44 to 52 in c4c3012

    
           # Gets the default rule list. 
        
           # 
        
           # Initializes a new {PublicSuffix::List} parsing the content 
        
           # of {PublicSuffix::List.default_list_content}, if required. 
        
           # 
        
           # @return [PublicSuffix::List] 
        
           def self.default(**options) 
        
             @default ||= parse(File.read(DEFAULT_LIST_PATH), **options) 
        
           end

$ docker run --rm -it ruby:2.4-slim-buster bash
root@31cd6631fcaa:/# gem install public_suffix
Fetching public_suffix-4.0.6.gem
Successfully installed public_suffix-4.0.6
1 gem installed
root@31cd6631fcaa:/# ruby -rpublic_suffix -e 'PublicSuffix::List.default'
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:51:in `default'
	from -e:1:in `<main>'
root@31cd6631fcaa:/# LANG=C.UTF-8 ruby -rpublic_suffix -e 'PublicSuffix::List.default'

zavan · 2021-02-19T14:51:21Z

I'm encountering an error that is probably related to this:

domain = PublicSuffix.domain(request.host)
Tenant.find_by!(domain: domain)

Raises:
ArgumentError (Cannot transliterate strings with ASCII-8BIT encoding)

Forcing UTF-8 works:

domain = PublicSuffix.domain(host).to_s.force_encoding('UTF-8')

Ruby: 3.0.0
Rails: 6.1.3
Gem: 4.0.6

mcarpenter · 2024-05-03T10:39:31Z

Two workarounds below.

Set the encoding using the Ruby interpreter's -E flag:

ruby -E utf-8 ./foo.rb

Set the external encoding progamatically:

require 'public_suffix'

Encoding.default_external = 'utf-8'
puts PublicSuffix.parse('example.com').inspect

dentarg mentioned this issue Sep 19, 2016

"ArgumentError: invalid byte sequence in US-ASCII" when parsing the public suffix list twingly/twingly-url#98

Closed

weppos self-assigned this Oct 15, 2016

weppos added the bug label Oct 15, 2016

weppos removed their assignment Mar 6, 2017

dentarg mentioned this issue Apr 1, 2023

Encoding is changed through normalisation sporkmonger/addressable#100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it #118

Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it #118

dentarg commented Sep 19, 2016

weppos commented Oct 15, 2016

dentarg commented Oct 16, 2016

dentarg commented Oct 16, 2016

dentarg commented Oct 16, 2016

tamoyal commented Sep 8, 2018

SeanDunford commented Apr 3, 2019

weppos commented Apr 4, 2019

aleksandrs-ledovskis commented Apr 4, 2019

weppos commented Apr 4, 2019

alexef commented Feb 5, 2021 •

edited

Loading

dentarg commented Feb 5, 2021 •

edited

Loading

zavan commented Feb 19, 2021

mcarpenter commented May 3, 2024

Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it #118

Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it #118

Comments

dentarg commented Sep 19, 2016

weppos commented Oct 15, 2016

dentarg commented Oct 16, 2016

dentarg commented Oct 16, 2016

dentarg commented Oct 16, 2016

tamoyal commented Sep 8, 2018

SeanDunford commented Apr 3, 2019

weppos commented Apr 4, 2019

aleksandrs-ledovskis commented Apr 4, 2019

weppos commented Apr 4, 2019

alexef commented Feb 5, 2021 • edited Loading

dentarg commented Feb 5, 2021 • edited Loading

zavan commented Feb 19, 2021

mcarpenter commented May 3, 2024

alexef commented Feb 5, 2021 •

edited

Loading

dentarg commented Feb 5, 2021 •

edited

Loading