-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ruby Lexer bug] An emoji cannot start a name expression #2009
Comments
Currently symbol names in this syntax have to start with a-z: rouge/lib/rouge/lexers/ruby.rb Line 86 in 1687d63
Happy to switch this to a unicode property ( |
The emoji does appear to work in the Ruby console though. |
Also while I'm here this line should probably be non-greedy, even if it's protected by never matching unescaped rouge/lib/rouge/lexers/ruby.rb Line 87 in 1687d63
|
It seems upper-case letters also work. My guess is that it is the same as the rules for a variable or a constant. Does |
Would this MR help to address this issue #1894? |
No. It fixes things like |
That is curious, the regexp really should be case-insensitive as well. |
/\p{Word}/i =~ '啊' # => 0
/\p{Word}/i =~ '🎃' # => nil |
Matching emojis seems to be tricky as they can't be captured in a regex range. I believe this gem https://github.com/ticky/ruby-emoji-regex might have the regex to solve it 🤔
|
That does not solve the issue either because there are non-word characters other than emojis that can start a name, such as the Chinese period: /\p{Word}/i =~ '。' # => nil
。=1 # no SyntaxError I think any non-ASCII characters can do it? So what we really need to use is |
I'm a little nervous about significantly expanding the name rules without some assurance from Ruby folks about what it is exactly they intend to support. Does ruby have a spec for this? |
There should have a spec in ISO/IEC 30170:2012, but I didn't buy it. The documentation on ruby-doc.org says any character with the eighth bit set can start a variable, though, so |
It seems that ISO/IEC 30170:2012 does not talk about non-ASCII characters, so strictly speaking using non-ASCII characters as identifiers is undefined behavior. It just happens to be implemented and documented in CRuby and maybe other Ruby implementations. I do not know what is the philosophy of Rouge, but I am supporting adopt |
Looks like at least one other implementation agrees: |
Name of the lexer
Ruby
Code sample
https://rouge.jneen.net/v4.2.0/ruby/e_CfjoM6MX0
Additional context
Output:
The class of the
🎃
character iserr
, which is not correct (should ben
).The text was updated successfully, but these errors were encountered: