Support mlb option "allowExtendedTextConsts true" #49

UltimatePea · 2022-04-11T19:40:49Z

Currently, smlfmt will report an error on non-ascii input.

Example file:

val a = "🍰"

Error message:

-- SYNTAX ERROR ----------------------------------------------------------------

Invalid character.

test.sml
  | 
1 | val a = "🍰"
  |          ^

Strings can only contain printable (visible or whitespace) ASCII characters.

Expected behavior

Strings need to handle UTF8 non-ascii characters.

The text was updated successfully, but these errors were encountered:

shwestrick · 2022-04-11T20:38:02Z

Supporting this won't be too bad, but will require changes in a few places.

For the lexer, we'll need to skip over UTF8 characters in the function advance_oneCharOrEscapeSequenceInString. Note that this function already skips over escape sequences; handling UTF8 should be similar. And then we can selectively enable this functionality by adding an additional flag to the lexer functions Lexer.next and Lexer.tokens.

We'll need to update the implementation of Source, too, such as Source.absoluteStart which returns the position (line and col) of a source file segment. Currently these are computed via byte offsets, which is no longer correct under UTF8. I believe other functions will need to be updated, too, to ensure that a Source.t never starts or ends in the middle of a UTF8 sequence.

shwestrick · 2022-04-11T20:55:33Z

By the way, what is the accepted standard practice these days for visually handling "characters" that are encoded as more than one UTF8 character? E.g., the flag emoji "🇺🇸" is actually two UTF8 characters ("🇺" followed by "🇸"). But of course, it is intended to be visually represented as a single character.

My initial thought is that this is important for smlfmt because we need to know positions to vertically align things correctly. Do we use the UTF8 semantic position, or the intended visual position? I'm inclined to use UTF8 semantic position...

UltimatePea · 2022-04-11T21:18:55Z

Thanks for the info!

I am not very familiar with UTF8/Unicode, but I would suggest we at least fix the lexer to not produce an error when encountering a UTF8 character.

I am not so familiar with the difference between semantic position and visual position, so I vote for whatever is easier to implement, which is probably UTF8 semantic position.

shwestrick · 2023-01-09T18:31:29Z

It occurred to me that a simpler way to support this is to allow for UTF-8 bytes but not check for validity of a UTF-8 byte sequence. #74 implements this.

By default, this is disabled. It can be enabled with -allow-extended-text-consts true at the command-line, or with the "allowExtendedTextConsts true" annotation within an MLB.

Your example above should now be working. Let me know if you have any trouble!

shwestrick mentioned this issue Jan 9, 2023

SuccessorML extended text constants #74

Merged

shwestrick closed this as completed in #74 Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mlb option "allowExtendedTextConsts true" #49

Support mlb option "allowExtendedTextConsts true" #49

UltimatePea commented Apr 11, 2022

shwestrick commented Apr 11, 2022

shwestrick commented Apr 11, 2022

UltimatePea commented Apr 11, 2022

shwestrick commented Jan 9, 2023 •

edited

Loading

Support mlb option "allowExtendedTextConsts true" #49

Support mlb option "allowExtendedTextConsts true" #49

Comments

UltimatePea commented Apr 11, 2022

shwestrick commented Apr 11, 2022

shwestrick commented Apr 11, 2022

UltimatePea commented Apr 11, 2022

shwestrick commented Jan 9, 2023 • edited Loading

shwestrick commented Jan 9, 2023 •

edited

Loading