Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for valid utf-8 in inputstream.rb gives false negatives when $KCODE is set to "UTF8" [w/fix] #66

Open
GoogleCodeExporter opened this issue Mar 22, 2015 · 1 comment

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?

Parsing a string containing certain unicode characters, such as [ (U+FF3B 
FULLWIDTH LEFT 
SQUARE BRACKET, not to be confused with [). For example, run this program:

require 'html5'
include HTML5
t="test\357\274\273\343\201\202\357\274\275\n"
$KCODE="UTF8"
print HTMLParser.parse_fragment(t,{:encoding => 'utf-8'})
$KCODE="NONE"
print HTMLParser.parse_fragment(t,{:encoding => 'utf-8'})


What is the expected output? What do you see instead?

Expected output:
test[あ]
test[あ]

Actual output:
test���あ���
test[あ]


Please provide any additional information below.

Some Ruby applications run with $KCODE set to UTF8; notably, this is the 
default for Ruby on 
Rails applications. An effect of this setting is that regular expressions 
support Unicode 
characters by default (ie, /a/ acts like /a/u). inputstream.rb uses a regular 
expression to check 
for valid utf-8:

        when 0xC0..0xFF
          if instance_variables.include?("@win1252") && @win1252
            "\xC3" + (c - 64).chr # convert to utf-8
          # from http://www.w3.org/International/questions/qa-forms-utf-8.en.php
          elsif @buffer[@tell - 1..@tell + 3] =~ /^
                ( [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
                |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
                | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
                |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
                |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
                | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
                |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
                )/x
            @tell += $1.length - 1
            $1
          else
            [0xFFFD].pack('U') # invalid utf-8
          end

When $KCODE is set to UTF8, the expression fails to recognize the utf-8 
representation of [ as 
valid. The problem can be solved by adding the "n" option at the end of the 
expression. For 
example:

irb(main):004:0> $KCODE='UTF8'
=> "UTF8"
irb(main):005:0> "\357\274\273" =~ /^( [\xC2-\xDF][\x80-\xBF] | 
\xE0[\xA0-\xBF][\x80-
\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | 
\xF0[\x90-
\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} 
)/x
=> nil
irb(main):006:0> "\357\274\273" =~ /^( [\xC2-\xDF][\x80-\xBF] | 
\xE0[\xA0-\xBF][\x80-
\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | 
\xF0[\x90-
\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} 
)/xn
=> 0


(I blame the lack of a preview button for any errors in this submission ;-) )

Original issue reported on code.google.com by [email protected] on 27 Apr 2008 at 1:22

@GoogleCodeExporter
Copy link
Author

Original comment by [email protected] on 8 Jun 2008 at 9:33

  • Added labels: Port-Ruby

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant