-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-ASCII directive names #23
Comments
Here I shared my specific problem. I don't object to the current implementation with the imposed limitations backed up with solid reasoning in the readme about spacing and trailing colons. I would love to understand the rationale behind limiting the directive naming. |
@wooorm may be able to offer more context. |
The reason the current state is the way it is, is so that I didn’t have to decide. Custom elements looks like a good thing to be compatible with. Although I don’t think a) the I wonder whether we need to enforce the disallowed ASCII punctuation/symbols though. I can see Maybe simplest is to allow all unicode characters that are not unicode whitespace? https://github.com/micromark/micromark/blob/929275e2ccdfc8fd54adb1e1da611020600cc951/packages/micromark-util-character/dev/index.js#L232 |
@wooorm and @ChristianMurphy thank you for sharing your details. I also have assumed Thinking of a potential solution, character ranges listed in the HTML standard for custom element names seem to be reasonable to me. The PCENChar (potential custom element name character) is quite wide; it seems to allow all "alphabets", including characters needed in my case.
Yet, it is beyond the proposed simplest solution and still enforces some limits. What do you think? Script, I used to preview rangesI am not knowledgeable in the Unicode char ranges, so I asked ChatGPT what range numbers mean (extended Latin, Japanese, Greek, Cyrillic etc) and reviewed the list manually using a script. // "-"
// "."
// [0-9]
// "_"
// [a-z]
chars.push(String.fromCharCode(0xB7))
for (let i = 0xC0; i <= 0xD6; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xD8; i <= 0xF6; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xF8; i <= 0x37D; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x37F; i <= 0x1FFF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x200C; i <= 0x200D; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x203F; i <= 0x2040; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x2070; i <= 0x218F; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x2C00; i <= 0x2FEF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x3001; i <= 0xD7FF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xF900; i <= 0xFDCF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xFDF0; i <= 0xFFFD; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x10000; i <= 0xEFFFF; ++i) chars.push(String.fromCharCode(i))
console.log(chars.join('\n')) |
Some more considerations:
Custom elements allow basically all higher-than-ascii punctuation, and in the ASCII range So I’d prefer starting with few ASCII punctuation, we can expand later:
|
@wooorm do you have I have found that This may be expanded to: export const unicodeAlphanumeric = regexCheck(/[\p{L}\p{N}]/u) If we come to an agreement, I could prepare a pull request. What do you think? |
We already have the parts in micromark. I think this is fine: const fine = code <= codes.del
? code === codes.dash ||
code === codes.dot ||
code === codes.underscore ||
asciiAlphanumeric(code)
: classifyCharacter(code) !== constants.characterGroupWhitespace Using |
Note I think similar rules need to be applied to attribute names. They are a bit more complex because say Attributes are also prohibited from starting with an ASCII number (they’re currently only accepting ASCII too). I wonder if that’s needed. |
Initial checklist
Problem
I write text files using an extended markdown syntax with a flavour for specific needs. Those text files are not in Latin script. I want to keep them in a uniform language without formatting prompts in English.
Markdown in general appears to have a language-independent syntax. ASCII-limited directives bring language-dependence.
Specific example
I am a Ukrainian speaker, creating a project for the local community with no internationalisation need in the future. I want to keep files in my native language as much as possible and have syntax as simple as possible.
My text files are songs. Sometimes, they contain a chorus that repeats after each verse (paragraph). Take a timely example:
My custom script detects the chorus and repeats it after each paragraph. However,
chorus
in Ukrainian isприспів
and I would love to keep that native word in a Ukrainian text.Solution
Configurable naming limitations.
Alternatives
The text was updated successfully, but these errors were encountered: