Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeSyntax support #93

Open
maralorn opened this issue Nov 19, 2022 · 10 comments · Fixed by #95
Open

UnicodeSyntax support #93

maralorn opened this issue Nov 19, 2022 · 10 comments · Fixed by #95

Comments

@maralorn
Copy link

maralorn commented Nov 19, 2022

I may be holding it wrong, but at least some unicode symbols are not supported as syntax:

e.g.:

processStateUpdater 
   a m.
  (NOMInput a, UpdateMonad m) 
  Config 
  u  a 
  StateT (ProcessState a) m ([NOMError], ByteString)

gives me

(haskell [0, 0] - [5, 52]
  (top_splice [0, 0] - [5, 52]
    (exp_infix [0, 0] - [5, 52]
      (exp_apply [0, 0] - [1, 9]
        (exp_name [0, 0] - [0, 19]
          (variable [0, 0] - [0, 19]))
        (ERROR [0, 20] - [1, 5]
          (ERROR [0, 20] - [0, 23]))
        (exp_name [1, 6] - [1, 7]
          (variable [1, 6] - [1, 7]))
        (exp_name [1, 8] - [1, 9]
          (variable [1, 8] - [1, 9])))
      (operator [1, 9] - [1, 10])
      (exp_apply [2, 2] - [5, 52]
        (exp_tuple [2, 2] - [2, 29]
          (exp_apply [2, 3] - [2, 13]
            (exp_name [2, 3] - [2, 11]
              (constructor [2, 3] - [2, 11]))
            (exp_name [2, 12] - [2, 13]
              (variable [2, 12] - [2, 13])))
          (comma [2, 13] - [2, 14])
          (exp_apply [2, 15] - [2, 28]
            (exp_name [2, 15] - [2, 26]
              (constructor [2, 15] - [2, 26]))
            (exp_name [2, 27] - [2, 28]
              (variable [2, 27] - [2, 28]))))
        (ERROR [2, 30] - [2, 33]
          (ERROR [2, 30] - [2, 33]))
        (exp_name [3, 2] - [3, 8]
          (constructor [3, 2] - [3, 8]))
        (ERROR [3, 9] - [3, 12]
          (ERROR [3, 9] - [3, 12]))
        (exp_name [4, 2] - [4, 3]
          (variable [4, 2] - [4, 3]))
        (ERROR [4, 4] - [4, 7]
          (ERROR [4, 4] - [4, 7]))
        (exp_name [5, 2] - [5, 8]
          (constructor [5, 2] - [5, 8]))
        (exp_parens [5, 9] - [5, 25]
          (exp_apply [5, 10] - [5, 24]
            (exp_name [5, 10] - [5, 22]
              (constructor [5, 10] - [5, 22]))
            (exp_name [5, 23] - [5, 24]
              (variable [5, 23] - [5, 24]))))
        (exp_name [5, 26] - [5, 27]
          (variable [5, 26] - [5, 27]))
        (exp_tuple [5, 28] - [5, 52]
          (exp_list [5, 29] - [5, 39]
            (exp_name [5, 30] - [5, 38]
              (constructor [5, 30] - [5, 38])))
          (comma [5, 39] - [5, 40])
          (exp_name [5, 41] - [5, 51]
            (constructor [5, 41] - [5, 51])))))))
@tek
Copy link
Contributor

tek commented Nov 19, 2022

right! I added some basics now, but there are some more missing.

@maralorn
Copy link
Author

Thank you for the quick reaction. Yeah, those are probably the most important, nice.

Here is the list of all symbols, not so many are missing:

https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/unicode_syntax.html

@tek
Copy link
Contributor

tek commented Nov 19, 2022

yep, already have that tab open 😉

@timtro
Copy link

timtro commented Dec 22, 2022

Thanks in advance. I wish I could help solve and not merely report the issue. But I'm getting errors when I simply use unicode characters in/as identifiers.

Possibly helpful links:

Example below, and please, don't judge me for the quality of this code. It's my first Haskell program, and it's fit for a very specific purpose which is not production. (It mirrors a theoretical construction in my PhD thesis in systems theory.)

{-# LANGUAGE InstanceSigs  #-}
{-# LANGUAGE UnicodeSyntax #-}

module PrAlgebra where

import           Data.Fix (Fix (Fix), foldFix, unFix)

(▽) :: (a  c)  (b  c)  Either a b  c
(▽) = either

(△) :: (b  c)  (b  c')  b  (c, c')
(△) f g x = (f x, g x)

newtype 𝘗hd tl = Pᵣ (Maybe (tl, hd))

instance Functor (𝘗hd) where
  fmap :: (a  b)  𝘗hd a  𝘗hd b
  fmap f (PNothing)         = PNothing
  fmap f (Pᵣ (Just (tl, hd))) = Pᵣ (Just (f tl, hd))

type 𝘗Algebra state value =  𝘗value state  state

type Snoc hd = Fix(𝘗hd)

snoc :: Snoc a  a  Snoc a
snoc xs x = Fix (Pᵣ (Just (xs, x)))

In terms of syntax highlighting, everything is coloured as a type. Here is a screenshot where a constructor is being called a type.
Screenshot from 2022-12-22 09-50-47

The tree is listed below.

pragma [0, 0] - [0, 30]
pragma [1, 0] - [1, 30]
module: module [3, 7] - [3, 16]
where [3, 17] - [3, 22]
ERROR [5, 0] - [25, 37]
  import [5, 0] - [5, 53]
    qualified_module [5, 17] - [5, 25]
      module [5, 17] - [5, 21]
      module [5, 22] - [5, 25]
    import_list [5, 26] - [5, 53]
      import_item [5, 27] - [5, 36]
        type [5, 27] - [5, 30]
        import_con_names [5, 31] - [5, 36]
          constructor [5, 32] - [5, 35]
      comma [5, 36] - [5, 37]
      import_item [5, 38] - [5, 45]
        variable [5, 38] - [5, 45]
      comma [5, 45] - [5, 46]
      import_item [5, 47] - [5, 52]
        variable [5, 47] - [5, 52]
  pat_literal [7, 0] - [7, 5]
    con_unit [7, 0] - [7, 5]
      ERROR [7, 1] - [7, 4]
        ERROR [7, 1] - [7, 4]
  type_parens [7, 9] - [7, 18]
    fun [7, 10] - [7, 17]
      type_name [7, 10] - [7, 11]
        type_variable [7, 10] - [7, 11]
      type_name [7, 16] - [7, 17]
        type_variable [7, 16] - [7, 17]
  type_parens [7, 23] - [7, 32]
    fun [7, 24] - [7, 31]
      type_name [7, 24] - [7, 25]
        type_variable [7, 24] - [7, 25]
      type_name [7, 30] - [7, 31]
        type_variable [7, 30] - [7, 31]
  type_apply [7, 37] - [7, 47]
    type_name [7, 37] - [7, 43]
      type [7, 37] - [7, 43]
    type_name [7, 44] - [7, 45]
      type_variable [7, 44] - [7, 45]
    type_name [7, 46] - [7, 47]
      type_variable [7, 46] - [7, 47]
  constraint [7, 52] - [25, 37]
    class: class_name [7, 52] - [7, 53]
      type_variable [7, 52] - [7, 53]
    type_literal [8, 0] - [8, 5]
      con_unit [8, 0] - [8, 5]
        ERROR [8, 1] - [8, 4]
          ERROR [8, 1] - [8, 4]
    ERROR [8, 6] - [8, 7]
    type_name [8, 8] - [8, 14]
      type_variable [8, 8] - [8, 14]
    type_literal [10, 0] - [10, 5]
      con_unit [10, 0] - [10, 5]
        ERROR [10, 1] - [10, 4]
          ERROR [10, 1] - [10, 4]
    ERROR [10, 6] - [10, 8]
    type_parens [10, 9] - [10, 18]
      fun [10, 10] - [10, 17]
        type_name [10, 10] - [10, 11]
          type_variable [10, 10] - [10, 11]
        type_name [10, 16] - [10, 17]
          type_variable [10, 16] - [10, 17]
    ERROR [10, 19] - [10, 22]
    type_parens [10, 23] - [10, 33]
      fun [10, 24] - [10, 32]
        type_name [10, 24] - [10, 25]
          type_variable [10, 24] - [10, 25]
        type_name [10, 30] - [10, 32]
          type_variable [10, 30] - [10, 32]
    ERROR [10, 34] - [10, 37]
    type_name [10, 38] - [10, 39]
      type_variable [10, 38] - [10, 39]
    ERROR [10, 40] - [10, 43]
    type_tuple [10, 44] - [10, 51]
      type_name [10, 45] - [10, 46]
        type_variable [10, 45] - [10, 46]
      comma [10, 46] - [10, 47]
      type_name [10, 48] - [10, 50]
        type_variable [10, 48] - [10, 50]
    type_literal [11, 0] - [11, 5]
      con_unit [11, 0] - [11, 5]
        ERROR [11, 1] - [11, 4]
          ERROR [11, 1] - [11, 4]
    type_name [11, 6] - [11, 7]
      type_variable [11, 6] - [11, 7]
    type_name [11, 8] - [11, 9]
      type_variable [11, 8] - [11, 9]
    type_name [11, 10] - [11, 11]
      type_variable [11, 10] - [11, 11]
    ERROR [11, 12] - [11, 13]
    type_tuple [11, 14] - [11, 24]
      type_apply [11, 15] - [11, 18]
        type_name [11, 15] - [11, 16]
          type_variable [11, 15] - [11, 16]
        type_name [11, 17] - [11, 18]
          type_variable [11, 17] - [11, 18]
      comma [11, 18] - [11, 19]
      type_apply [11, 20] - [11, 23]
        type_name [11, 20] - [11, 21]
          type_variable [11, 20] - [11, 21]
        type_name [11, 22] - [11, 23]
          type_variable [11, 22] - [11, 23]
    type_name [13, 0] - [13, 7]
      type_variable [13, 0] - [13, 7]
    ERROR [13, 8] - [13, 15]
      ERROR [13, 8] - [13, 15]
    type_name [13, 16] - [13, 18]
      type_variable [13, 16] - [13, 18]
    type_name [13, 19] - [13, 21]
      type_variable [13, 19] - [13, 21]
    ERROR [13, 22] - [13, 23]
    type_name [13, 24] - [13, 25]
      type [13, 24] - [13, 25]
    ERROR [13, 25] - [13, 28]
      ERROR [13, 25] - [13, 28]
    type_parens [13, 29] - [13, 45]
      type_apply [13, 30] - [13, 44]
        type_name [13, 30] - [13, 35]
          type [13, 30] - [13, 35]
        type_tuple [13, 36] - [13, 44]
          type_name [13, 37] - [13, 39]
            type_variable [13, 37] - [13, 39]
          comma [13, 39] - [13, 40]
          type_name [13, 41] - [13, 43]
            type_variable [13, 41] - [13, 43]
    type_name [15, 0] - [15, 8]
      type_variable [15, 0] - [15, 8]
    type_name [15, 9] - [15, 16]
      type [15, 9] - [15, 16]
    type_parens [15, 17] - [15, 29]
      ERROR [15, 18] - [15, 25]
        ERROR [15, 18] - [15, 25]
      type_name [15, 26] - [15, 28]
        type_variable [15, 26] - [15, 28]
    type_name [15, 30] - [15, 35]
      type_variable [15, 30] - [15, 35]
    type_name [16, 2] - [16, 6]
      type_variable [16, 2] - [16, 6]
    ERROR [16, 7] - [16, 9]
    type_parens [16, 10] - [16, 19]
      fun [16, 11] - [16, 18]
        type_name [16, 11] - [16, 12]
          type_variable [16, 11] - [16, 12]
        type_name [16, 17] - [16, 18]
          type_variable [16, 17] - [16, 18]
    ERROR [16, 20] - [16, 31]
      ERROR [16, 24] - [16, 31]
    type_name [16, 32] - [16, 34]
      type_variable [16, 32] - [16, 34]
    type_name [16, 35] - [16, 36]
      type_variable [16, 35] - [16, 36]
    ERROR [16, 37] - [16, 48]
      ERROR [16, 41] - [16, 48]
    type_name [16, 49] - [16, 51]
      type_variable [16, 49] - [16, 51]
    type_name [16, 52] - [16, 53]
      type_variable [16, 52] - [16, 53]
    type_name [17, 2] - [17, 6]
      type_variable [17, 2] - [17, 6]
    type_name [17, 7] - [17, 8]
      type_variable [17, 7] - [17, 8]
    type_parens [17, 9] - [17, 23]
      type_apply [17, 10] - [17, 22]
        type_name [17, 10] - [17, 11]
          type [17, 10] - [17, 11]
        ERROR [17, 11] - [17, 14]
          ERROR [17, 11] - [17, 14]
        type_name [17, 15] - [17, 22]
          type [17, 15] - [17, 22]
    ERROR [17, 32] - [17, 33]
    type_name [17, 34] - [17, 35]
      type [17, 34] - [17, 35]
    ERROR [17, 35] - [17, 38]
      ERROR [17, 35] - [17, 38]
    type_name [17, 39] - [17, 46]
      type [17, 39] - [17, 46]
    type_name [18, 2] - [18, 6]
      type_variable [18, 2] - [18, 6]
    type_name [18, 7] - [18, 8]
      type_variable [18, 7] - [18, 8]
    type_parens [18, 9] - [18, 31]
      type_apply [18, 10] - [18, 30]
        type_name [18, 10] - [18, 11]
          type [18, 10] - [18, 11]
        ERROR [18, 11] - [18, 14]
          ERROR [18, 11] - [18, 14]
        type_parens [18, 15] - [18, 30]
          type_apply [18, 16] - [18, 29]
            type_name [18, 16] - [18, 20]
              type [18, 16] - [18, 20]
            type_tuple [18, 21] - [18, 29]
              type_name [18, 22] - [18, 24]
                type_variable [18, 22] - [18, 24]
              comma [18, 24] - [18, 25]
              type_name [18, 26] - [18, 28]
                type_variable [18, 26] - [18, 28]
    ERROR [18, 32] - [18, 33]
    type_name [18, 34] - [18, 35]
      type [18, 34] - [18, 35]
    ERROR [18, 35] - [18, 38]
      ERROR [18, 35] - [18, 38]
    type_parens [18, 39] - [18, 56]
      type_apply [18, 40] - [18, 55]
        type_name [18, 40] - [18, 44]
          type [18, 40] - [18, 44]
        type_tuple [18, 45] - [18, 55]
          type_apply [18, 46] - [18, 50]
            type_name [18, 46] - [18, 47]
              type_variable [18, 46] - [18, 47]
            type_name [18, 48] - [18, 50]
              type_variable [18, 48] - [18, 50]
          comma [18, 50] - [18, 51]
          type_name [18, 52] - [18, 54]
            type_variable [18, 52] - [18, 54]
    type_name [20, 0] - [20, 4]
      type_variable [20, 0] - [20, 4]
    ERROR [20, 5] - [20, 12]
      ERROR [20, 5] - [20, 12]
    type_name [20, 12] - [20, 19]
      type [20, 12] - [20, 19]
    type_name [20, 20] - [20, 25]
      type_variable [20, 20] - [20, 25]
    type_name [20, 26] - [20, 31]
      type_variable [20, 26] - [20, 31]
    ERROR [20, 32] - [20, 42]
      ERROR [20, 35] - [20, 42]
    type_name [20, 43] - [20, 48]
      type_variable [20, 43] - [20, 48]
    type_name [20, 49] - [20, 54]
      type_variable [20, 49] - [20, 54]
    ERROR [20, 55] - [20, 58]
    type_name [20, 59] - [20, 64]
      type_variable [20, 59] - [20, 64]
    type_name [22, 0] - [22, 4]
      type_variable [22, 0] - [22, 4]
    type_name [22, 5] - [22, 9]
      type [22, 5] - [22, 9]
    type_name [22, 10] - [22, 12]
      type_variable [22, 10] - [22, 12]
    ERROR [22, 13] - [22, 14]
    type_name [22, 15] - [22, 18]
      type [22, 15] - [22, 18]
    type_parens [22, 18] - [22, 30]
      ERROR [22, 19] - [22, 26]
        ERROR [22, 19] - [22, 26]
      type_name [22, 27] - [22, 29]
        type_variable [22, 27] - [22, 29]
    type_name [24, 0] - [24, 4]
      type_variable [24, 0] - [24, 4]
    ERROR [24, 5] - [24, 7]
    type_name [24, 8] - [24, 12]
      type [24, 8] - [24, 12]
    type_name [24, 13] - [24, 14]
      type_variable [24, 13] - [24, 14]
    ERROR [24, 15] - [24, 18]
    type_name [24, 19] - [24, 20]
      type_variable [24, 19] - [24, 20]
    ERROR [24, 21] - [24, 24]
    type_name [24, 25] - [24, 29]
      type [24, 25] - [24, 29]
    type_name [24, 30] - [24, 31]
      type_variable [24, 30] - [24, 31]
    type_name [25, 0] - [25, 4]
      type_variable [25, 0] - [25, 4]
    type_name [25, 5] - [25, 7]
      type_variable [25, 5] - [25, 7]
    type_name [25, 8] - [25, 9]
      type_variable [25, 8] - [25, 9]
    ERROR [25, 10] - [25, 11]
    type_name [25, 12] - [25, 15]
      type [25, 12] - [25, 15]
    type_parens [25, 16] - [25, 37]
      type_apply [25, 17] - [25, 36]
        type_name [25, 17] - [25, 18]
          type [25, 17] - [25, 18]
        ERROR [25, 18] - [25, 21]
          ERROR [25, 18] - [25, 21]
        type_parens [25, 22] - [25, 36]
          type_apply [25, 23] - [25, 35]
            type_name [25, 23] - [25, 27]
              type [25, 23] - [25, 27]
            type_tuple [25, 28] - [25, 35]
              type_name [25, 29] - [25, 31]
                type_variable [25, 29] - [25, 31]
              comma [25, 31] - [25, 32]
              type_name [25, 33] - [25, 34]
                type_variable [25, 33] - [25, 34]

@tek
Copy link
Contributor

tek commented Feb 8, 2023

I added three more symbols for built-in syntax.

I also took a look at the symbolic operator situation, and it's a little bit more difficult.
Legal characters for these varsyms are determined by membership in unicode categories, which contain about 6000 code points in noncontiguous intervals.

We are parsing varsyms in the scanner, which means we don't have access to the unicode category regex classes that are provided by tree-sitter.
I couldn't find a method to do this in standard C, but maybe someone knows better?
For what it's worth, I tried adding a switch with 6k cases and performance only degraded by about 1%.

@maralorn
Copy link
Author

maralorn commented Feb 8, 2023

I am not sure, what the rules here are, but would it be terrible to over-approximate here? (Also don’t know if it would simplify things) I would assume that by allowing a larger class of unicode symbols that is maybe easier to check it would be unlikely to miss-parse valid Haskell?

@tek
Copy link
Contributor

tek commented Feb 8, 2023

possibly, but I'm absolutely uncertain. 6k code points in a range of 130k seems quite disproportionate, and they are spaced out pretty wide.
We could try > N for some value and test all smaller ones explicitly.
But since performance doesn't take a significant hit, we could also just put the 6k cases in a separate file in a switch and be done with it 🙃

@maralorn
Copy link
Author

maralorn commented Feb 8, 2023

Your call. I would also wonder a bit how much bigger the grammar would become …

@tek
Copy link
Contributor

tek commented Feb 8, 2023

the haskell.so grows by 10kB. (total 3.6MB)

@tek
Copy link
Contributor

tek commented Feb 9, 2023

the arrow notation operators appear not to be within the categories used for the PR we just merged. also unsure about those banana brackets, they would probably need special treatment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants