[Ruby Lexer bug] An emoji cannot start a name expression #2009

UlyssesZh · 2023-11-05T07:39:10Z

Name of the lexer
Ruby

Code sample

{🎃:1}

https://rouge.jneen.net/v4.2.0/ruby/e_CfjoM6MX0

Additional context

require 'rouge'
Rouge::Formatters::HTML.new.format Rouge::Lexers::Ruby.lex '{🎃:1}'

Output:

<span class="p">{</span><span class="err">🎃</span><span class="p">:</span><span class="mi">1</span><span class="p">}</span>

The class of the 🎃 character is err, which is not correct (should be n).

The text was updated successfully, but these errors were encountered:

jneen · 2023-11-05T13:40:25Z

Currently symbol names in this syntax have to start with a-z:

rouge/lib/rouge/lexers/ruby.rb

Line 86 in 1687d63

rule %r/\b[a-z_]\w*?[?!]?:\s+/, Str::Symbol, :expr_start

Happy to switch this to a unicode property (\p{...}) if you can find documentation of what it's supposed to be.

jneen · 2023-11-05T13:41:14Z

The emoji does appear to work in the Ruby console though.

jneen · 2023-11-05T13:44:29Z

Also while I'm here this line should probably be non-greedy, even if it's protected by never matching unescaped ':

rouge/lib/rouge/lexers/ruby.rb

Line 87 in 1687d63

rule %r/'(\\\\|\\'|[^'])*'/, Str::Single

UlyssesZh · 2023-11-05T17:41:05Z

Currently symbol names in this syntax have to start with a-z:

rouge/lib/rouge/lexers/ruby.rb

Line 86 in 1687d63

rule %r/\b[a-z_]\w*?[?!]?:\s+/, Str::Symbol, :expr_start

Happy to switch this to a unicode property (\p{...}) if you can find documentation of what it's supposed to be.

It seems upper-case letters also work. My guess is that it is the same as the rules for a variable or a constant.

Does \w match non-ASCII characters? If no, tthe \w*? to match the later characters is probably not right either.

tancnle · 2023-11-05T22:57:23Z

Would this MR help to address this issue #1894?

UlyssesZh · 2023-11-05T23:08:04Z

Would this MR help to address this issue #1894?

No. It fixes things like 啊=1 and {啊:1}, but not 🎃=1 or {🎃:1}. Seems like the new regexp rules still do not cover emojis.

jneen · 2023-11-07T17:08:45Z

That is curious, the regexp really should be case-insensitive as well.

UlyssesZh · 2023-11-07T17:20:23Z

That is curious, the regexp really should be case-insensitive as well.

/\p{Word}/i =~ '啊' # => 0
/\p{Word}/i =~ '🎃' # => nil

tancnle · 2023-11-08T00:01:13Z

Matching emojis seems to be tricky as they can't be captured in a regex range. I believe this gem https://github.com/ticky/ruby-emoji-regex might have the regex to solve it 🤔

[1] pry(main)> require 'emoji_regex'
[2] pry(main)> EmojiRegex::Regex.match('🎃')
=> #<MatchData "🎃">

UlyssesZh · 2023-11-08T00:11:02Z

Matching emojis seems to be tricky as they can't be captured in a regex range. I believe this gem https://github.com/ticky/ruby-emoji-regex might have the regex to solve it 🤔
[1] pry(main)> require 'emoji_regex'
[2] pry(main)> EmojiRegex::Regex.match('🎃')
=> #<MatchData "🎃">

That does not solve the issue either because there are non-word characters other than emojis that can start a name, such as the Chinese period:

/\p{Word}/i =~ '。' # => nil
。=1 # no SyntaxError

I think any non-ASCII characters can do it? So what we really need to use is [a-z_|^\x00-\x7F] (for variables) and [a-zA-Z_|^\x00-\x7F] (for Hash symbol keys).

jneen · 2023-11-08T00:58:52Z

I'm a little nervous about significantly expanding the name rules without some assurance from Ruby folks about what it is exactly they intend to support. Does ruby have a spec for this?

UlyssesZh · 2023-11-08T01:23:54Z

I'm a little nervous about significantly expanding the name rules without some assurance from Ruby folks about what it is exactly they intend to support. Does ruby have a spec for this?

There should have a spec in ISO/IEC 30170:2012, but I didn't buy it. The documentation on ruby-doc.org says any character with the eighth bit set can start a variable, though, so [^\x00-\x7F] should be it.

jneen · 2023-11-08T05:14:29Z

Looks right to me. I didn't want to believe it but it looks like they *actually* implemented it this way:

; ruby -e 'p 　:1'
{:　=>1}

This is a double-width space, also known as \xe3\x80\x80, and it can be used as an identifier 🤦

As long as there aren't any issues with that regexp leaving glyphs stranded, we should be able to support this somewhat. Basic testing with the Japanese block seems to indicate that [^\x00-\x7F] will work.

UlyssesZh · 2023-11-08T06:51:41Z

It seems that ISO/IEC 30170:2012 does not talk about non-ASCII characters, so strictly speaking using non-ASCII characters as identifiers is undefined behavior. It just happens to be implemented and documented in CRuby and maybe other Ruby implementations.

I do not know what is the philosophy of Rouge, but I am supporting adopt [^\x00-\x7F]. I have read some of the ISO documentation, and I must say it just gave way too much freedom to implementations of Ruby interpreters. CRuby should be used as the de facto standard instead (and it actually is, when people try to invent new implementations), and any Ruby highlighting program should also use CRuby as the standard.

jneen · 2023-11-08T16:10:17Z

Looks like at least one other implementation agrees:

https://github.com/lib-ruby-parser/lib-ruby-parser/blob/134edd54bac26dee604e27ea6d5537c20b265646/src/source/buffer.rs#L316

UlyssesZh added the bugfix-request A request for a bugfix to be developed. label Nov 5, 2023

UlyssesZh changed the title ~~[Ruby Lexer bug] Hash symbol key is sometimes regarded as error~~ [Ruby Lexer bug] An emoji cannot start a name expression Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ruby Lexer bug] An emoji cannot start a name expression #2009

[Ruby Lexer bug] An emoji cannot start a name expression #2009

UlyssesZh commented Nov 5, 2023 •

edited

Loading

jneen commented Nov 5, 2023 •

edited

Loading

jneen commented Nov 5, 2023

jneen commented Nov 5, 2023

UlyssesZh commented Nov 5, 2023

tancnle commented Nov 5, 2023

UlyssesZh commented Nov 5, 2023

jneen commented Nov 7, 2023

UlyssesZh commented Nov 7, 2023

tancnle commented Nov 8, 2023 •

edited

Loading

UlyssesZh commented Nov 8, 2023

jneen commented Nov 8, 2023

UlyssesZh commented Nov 8, 2023

jneen commented Nov 8, 2023

UlyssesZh commented Nov 8, 2023

jneen commented Nov 8, 2023

[Ruby Lexer bug] An emoji cannot start a name expression #2009

[Ruby Lexer bug] An emoji cannot start a name expression #2009

Comments

UlyssesZh commented Nov 5, 2023 • edited Loading

jneen commented Nov 5, 2023 • edited Loading

jneen commented Nov 5, 2023

jneen commented Nov 5, 2023

UlyssesZh commented Nov 5, 2023

tancnle commented Nov 5, 2023

UlyssesZh commented Nov 5, 2023

jneen commented Nov 7, 2023

UlyssesZh commented Nov 7, 2023

tancnle commented Nov 8, 2023 • edited Loading

UlyssesZh commented Nov 8, 2023

jneen commented Nov 8, 2023

UlyssesZh commented Nov 8, 2023

jneen commented Nov 8, 2023

UlyssesZh commented Nov 8, 2023

jneen commented Nov 8, 2023

UlyssesZh commented Nov 5, 2023 •

edited

Loading

jneen commented Nov 5, 2023 •

edited

Loading

tancnle commented Nov 8, 2023 •

edited

Loading