Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mlb option "allowExtendedTextConsts true" #49

Closed
UltimatePea opened this issue Apr 11, 2022 · 4 comments · Fixed by #74
Closed

Support mlb option "allowExtendedTextConsts true" #49

UltimatePea opened this issue Apr 11, 2022 · 4 comments · Fixed by #74

Comments

@UltimatePea
Copy link

Currently, smlfmt will report an error on non-ascii input.

Example file:

val a = "🍰"

Error message:

-- SYNTAX ERROR ----------------------------------------------------------------

Invalid character.

test.sml
  | 
1 | val a = "🍰"
  |          ^

Strings can only contain printable (visible or whitespace) ASCII characters.

Expected behavior

Strings need to handle UTF8 non-ascii characters.

@shwestrick
Copy link
Owner

Supporting this won't be too bad, but will require changes in a few places.

For the lexer, we'll need to skip over UTF8 characters in the function advance_oneCharOrEscapeSequenceInString. Note that this function already skips over escape sequences; handling UTF8 should be similar. And then we can selectively enable this functionality by adding an additional flag to the lexer functions Lexer.next and Lexer.tokens.

We'll need to update the implementation of Source, too, such as Source.absoluteStart which returns the position (line and col) of a source file segment. Currently these are computed via byte offsets, which is no longer correct under UTF8. I believe other functions will need to be updated, too, to ensure that a Source.t never starts or ends in the middle of a UTF8 sequence.

@shwestrick
Copy link
Owner

By the way, what is the accepted standard practice these days for visually handling "characters" that are encoded as more than one UTF8 character? E.g., the flag emoji "🇺🇸" is actually two UTF8 characters ("🇺" followed by "🇸"). But of course, it is intended to be visually represented as a single character.

My initial thought is that this is important for smlfmt because we need to know positions to vertically align things correctly. Do we use the UTF8 semantic position, or the intended visual position? I'm inclined to use UTF8 semantic position...

@UltimatePea
Copy link
Author

Thanks for the info!

I am not very familiar with UTF8/Unicode, but I would suggest we at least fix the lexer to not produce an error when encountering a UTF8 character.

I am not so familiar with the difference between semantic position and visual position, so I vote for whatever is easier to implement, which is probably UTF8 semantic position.

@shwestrick
Copy link
Owner

shwestrick commented Jan 9, 2023

It occurred to me that a simpler way to support this is to allow for UTF-8 bytes but not check for validity of a UTF-8 byte sequence. #74 implements this.

By default, this is disabled. It can be enabled with -allow-extended-text-consts true at the command-line, or with the "allowExtendedTextConsts true" annotation within an MLB.

Your example above should now be working. Let me know if you have any trouble!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants