Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper sum types and structural pattern matching #525

Open
omentic opened this issue May 21, 2023 · 64 comments
Open

Proper sum types and structural pattern matching #525

omentic opened this issue May 21, 2023 · 64 comments

Comments

@omentic
Copy link

omentic commented May 21, 2023

Abstract

Proper sum types and structural pattern matching for Nim by the introduction of union and extension of case with structural matching and and where clauses. Heavily inspired by Rust, Swift, Racket, Python, and others, and adapted to Nim's particular type system and syntax.

Motivation

I write a lot of Nim code, and I write a lot of Rust code. I find both to be excellent languages with their own unique tradeoffs and specialties. But there is one thing in particular I greatly miss when going from Rust to Nim: and that is algebraic data types and pattern matching.

Algebraic data types (ADTs), for those unfamiliar, are an extremely powerful data structure that are useful for modelling a wide variety of types. They allow, by the composition of product types and sum types (as well as ordinary types), the modelling a wide variety of different data structures.

Nim does have algebraic data types - objects may be composed arbitrarily. Their application to product types is straightforward: product types are directly analogous to structs (or in Nim, standard objects). Sum types complement product types and can be thought of as their dual: they act as a wrapper around different types, letting you interact with their inner structure only by safely pattern matching upon the container. They are best represented by tagged unions, however, which Nim lacks a direct implementation of.

Variant objects are capable of modelling tagged unions and thus sum types. These are useful as, well, variants of objects, but have a number of serious drawbacks when used as sum types that the first portion of this RFC aims to address.

  • The names of fields in variant objects must be unique - they cannot be repeated, even across different variants.
  • A separate enum must function as a "kind" tag and be discriminated on by case within the object. Having a separate single-use enum is undesirable, and having to discriminate against it is awkward.
  • Being forced into an object structure is awkward: the "type wrapper" approach of tagged unions is more ergonomic, especially when there are no shared fields, as is often the case.

As for pattern matching: Nim has the case statement, which alone is sufficient to make use of the inner contents of an sum type (or currently, switch on a kind label). But structural pattern matching is the bread and butter of ADTs: aside from simply being cleaner than nested if statements, it provides guarantees of exhaustiveness not otherwise given by the alternative. It is perhaps best exemplified by the ML family of languages (including Rust): but other languages have picked it up offering their own takes on it, notably Racket, Swift, and Haxe. This RFC pulls from a number of different designs to provide a powerful and Nim-esque model.

Description

This RFC is two-fold. It introduces a new fundamental type, union, with similar semantics to Rust's enum. It also extends Nim's existing case statement with support for structural pattern matching and particularly that of unions/objects/tuples (although it supports lists as well), and introduces a new supplemental where clause in order to cleanly and powerfully do so.

union is a type that functions similarly to enum at a surface level. It takes a list of identifiers as fields/variants/values, and an instance of a union may only ever be one of the fields. However, each field of the union may additionally be of its own entirely separate type (or none). This is best exemplified with the classic "shape" example:

type Shape = union
  Point
  Line: int
  Square: tuple[side: int]
  Rectangle: tuple[x, y: int]
  Triangle: TriangleKind
  # type TriangleKind = enum Scalene, Isosceles, Equilateral

These fields of the unions are unordered, unqualified (via the same mechanism as overloadable enums), and unique. The inner fields are each of distinct types: entries used solely for their label (ex. Point) are implicitly of void type. Instantiating a union, then, is done by picking a particular variant, and providing data for the construction of its associated type (or none) as needed. The syntax of this is again best shown with an example. As a special syntax rule, tuples and objects may drop their extraneous parentheses on construction.

var shape: Shape
shape = Point()
shape = Line(5)
shape = Square(side: 5)
shape = Rectangle(x: 5, y: 6)
shape = Triangle(Scalene)

To support proper and powerful pattern matching, we introduce three new extensions to the language: an extension for existing case/of statements to support (nested) structured types (array, seq, tuple, object, union), an extension for of expressions in regular conditionals to support the same, and a new where expression that functions as a reusable "guard". Just like union, these language features combine to provide power with minimal new syntax. Following existing case semantics, this pattern matching may be exhaustive (to be described later), or non-exhaustive and followed by elif and else statements.

Elements of a tuple may be matched in two ways: by providing a value, which will match all tuples with that value at that location, or by binding them to a new identifier, which will match all tuples with any value at that location, and will then be made available as a variable within the branch. New variables introduced by pattern matching in this fashion follow the mutability of the matched-upon statement they are a part of. As is the case elsewhere, binding an element to the _ identifier discards it.

where guards allow for restricting the matching of a case based on an arbitrary conditional expression. They have access to any bound fields: and so can be used to restrict variables to a value while still allowing access to that value among other, arbitrary conditions. where clauses follow the main match section of an of line and are considered to be attached to the main match rather than the of itself to allow for the reuse of labels to follow existing case semantics: see the example below. Of important note is the intersection between variable bindings and repeated cases: in order for a variable to be accessible within the body of the case, it must be bound (in a consistent location) in every label of that statement. We do not disallow this at a binding level because the bindings are still helpful for where clauses.

The exhaustiveness of a pattern match is determined on a best-effort attempt. Fields of booleans, enums, and unions consist of few possibilities and may be exhaustively matched upon. Guards do not contribute to the exhaustivity of a search as they may contain arbitrary conditional expressions: and so must typically be followed by a case for the same match without the guard, or else.

Named tuples and objects are matched identically to tuples, with one exception: the identifier name when binding a field must be the name of the original field. If this is undesired, the syntax oldident: _ as newident is introduced to allow for explicit rebinding. Unions are matched by specifying their label, and then their inner type in parenthesis. Tuples and objects within a union may elide their parenthesis.

func match(shape: Shape) =
  case shape
  of Point():
    ...
  of Line(_ as x) where x in 0..4:
    ...
  of Line(_ as x):
    ...
  # matching an existing variable
  of existingShape:
    ...
  # matching an existing variable, convolutedly
  of _ as x where x == existingShape:
    ...
  # tuples within a union elide their external parenthesis.
  of Square(side: _ as x):
    assert declared(x)
    assert not declared(side)
  # Rectangle(x, y) is syntax sugar for Rectangle(x: _ as x, y: _ as y)
  # matching multiple expressions follows case semantics
  of Rectangle(x, y) where x == y, Rectangle(0, _), Rectangle(_, 0):
    # x is not bound in every case! so it is not accessible
    ...
  else:
    ...

Previous thoughts on matching sequences, arrays, and sets are collapsed in the below block. I no longer think they're terribly important. They are also not updated for the new backwards-compatible _ as x syntax.

Matching sequences, arrays, setsMatching sequences and arrays is done slightly differently. To distinguish them from tuple and object matching, brackets are used instead of parenthesis. These are not elided within a union type as the parentheses of tuples and objects are. Matching of elements is done similarly to tuples: individual elements of the sequence/array are matched positionally, the variable/value matching distinction remains the same, and where guards are again available for conditional treatment of bindings.

As sequences may be of unknown length, and arrays may have uncomfortably long lengths, the .. operator is introduced/reused to denote a range of fields we do not care about matching, with the syntax [first, second, .. last]. As a result, the position of fields may be specified either from the beginning or from the end of the list - it is not possible to match a field an arbitrary distance into a long list within the match pattern itself. Again, however, where guards provide a more generic way to do so.

import std/[sequtils, sugar]

func lists[T](list: openarray[T]) =
  case list
  of [1, 3, 5]:
    ...
  of [first, second, third .. penultimate, last]:
    ...
  of x where x.len > 10 and x[10] == true:
    ...

  # lists may be arbitrarily nested, just like every structured type
  of [[1, 2, 3], [4, 5, 6], [7, 8, 9]]:
    ...

  # match an arbitrary list of integers
  of x where typeof(x) is openarray[int]:
    ...
  # match a list of entirely 1s:
  of x where x.all(y => (y == 1)):
    ...

  # the following three statements match identically:
  of [_, _, _]:
    ...
  of x where x.len == 3:
    ...
  elif list.len == 3:
    ...
  else:
    ...

Matching sets and tables mostly follows list semantics. Both are wrapped in braces instead of brackets or parentheses. As the elements of sets and tables are never named: the : does not function as it does in tuples and objects, instead, it takes its regular semantics within tables to denote a key-value pair. Of additional note is the .. operator: as sets and tables can be of unknown length it is helpful: however, as they are unordered, the .. operator must come last.

func sets(bitset: set[int8]) =
  case bitset
  # match the set containing exclusively 1
  of {1}:
    ...
  # match a set containing at least 1
  of {1, ..}:
    ...
  # match a set containing only two distinct elements
  of {x, y}:
    ...
  # match a set containing any two distinct elements fulfilling a condition
  of {x, y, ..} where x + y == 10:
    ...
  else:
    ...

func tables(table: Table[int, string]) =
  case table
  # match a table containing only these key-value pairs
  of {5: "foo", 6: "bar"}:
    ...
  # match a table containing at least these key-value pairs
  of {5: "foo", 6: "bar", ..}:
    ...
  # match a table containing any key matching a condition
  of {x: y, ..} where x == y.len:
    ...

While the exhaustivity of pattern matching is important: frequently you want to disregard it, and only worry about a single case. Rust provides the if let construct for this use case, which I find very useful but pretty gross syntactically. Instead, this RFC uses the syntax if shape of Square(side: binding). The bindings become immediately available for use within the body of the if statement and within any parts of the if statement following that clause, as shown by the following examples:

shape = Rectangle(x: 5, y: 6)

if shape of Rectangle(x, y) and x == 5:
  echo "y is " & $y
else:
  echo "failure."

Following the semantics above, these extended case and if ident of match statements should also allow unwrapping Option[T] types. What this looks like is not yet specified.

Code Examples

Algebraic Data Types

An example that perhaps more showcases the utility of ADTs is that of an abstract syntax tree for the simple lambda calculus.

type Ident = string

type Expr = ref union
  Constant: Term
  Variable: tuple[id: Ident]
  # Annotation: tuple[expr: Expr, kind: Type]
  Abstraction: tuple[param: Ident, fn: Expr]
  Application: tuple[fn, arg: Expr]
  Conditional: tuple[cond, `if`, `else`: Expr]

Note that other structures needed for an implementation of the simple lambda calculus (including a type system) are also efficiently and aesthetically modelled with ADTs.

type Type = union
  Empty, Error, Unit, Boolean, Natural, Integer, Float, String
  Enum: seq[Type]
  Struct: Table[Ident, Type]
  Function: tuple[from, to: ref Type]

type Term = union
  Unit
  Boolean: bool
  Natural: uint
  Integer: int
  Float: float
  String: tuple[len: uint, cap: uint, data: seq[uint]]
  Enum: tuple[val: uint, data: seq[Type]]
  Struct: Table[Ident, Term]
  Function: Expr
Pattern Matching
var globalSymbolTable = initTable[Ident, Term]()

func execute(ast: Expr): Term =
  case ast
  of Constant(_ as term):
    return term
  of Variable(id):
    return globalSymbolTable[id]
  # other structures of applications + abstractions are invalid
  of Application(Abstraction(param, fn), arg):
    globalSymbolTable[param] = execute(arg)
    return execute(fn)
  of Conditional(cond, `if`: _ as ifClause, `else`: _ as elseClause):
    if execute(cond) == Boolean(true):
      return execute(ifClause)
    else:
      return execute(elseClause)
  else:
    echo "Failed to execute AST! Improper structure"
    quit()

Backwards Compatibility

This RFC is almost fully backwards compatible with Nim. It will cause issues with anything using the new union or where keywords, and with any templates or macros directly inspecting case or of.

A previous version of this RFC had an issue around pattern matching: the previous syntax was of x, where x is a newly bound identifier. This conflicted with the existing semantics to resolve constants in those expressions. This was fixed by changing pattern matching to require explicitly rebinding the catchall character _, i.e. _ as name, with syntax sugar in selected cases (named tuples & structs).

Notes

To end with some notes: the general design of where guards is to provide all of the parts of pattern matching that cannot be exhaustive, in one general construct. Hence the use of where for things regarding ranges, lengths, existing variables, generic types (although: type constraints can be exhaustive. this is specific enough that i thought it not worth special syntax). where was also chosen over reusing if because I wanted to emphasize that there is not a corresponding elif/else as it is, already, what amounts to an elif clause within the broader case statement.

I think I like oldident: newident for field renaming. It matches object constructors. oldident as newident is an alternative, however.

I only mentioned it briefly, but because I've seen it discussed in other implementations of pattern matching in Nim: the mutability of the introduced bindings while pattern matching follows from the mutability of the matched-upon object, that is, you're directly mutating the matched-upon object. Other languages follow this pattern and I see no reason why this + variable shadowing if you really want to make it mutable are good enough.

I'm considering allowing guards to be reusable on for blocks a la Swift: with for binding in iterator where condition acting as syntax sugar to the common for binding in iterator: if condition. Is this desired? Is this good or bad for language consistency? Are there other places where clauses could be used?

The else statement and case _ are functionally the same. I believe this is currently the case.

References and Acknowledgements

Many thanks to fungus: which implements a near-identical union and quite similar match already as macros.
Many thanks to patty: particularly for their "does not yet work" section, ha.
Many thanks to Beef, for having good taste and calling my stupid ideas stupid.

@levovix0
Copy link

Is it really necessary to change the language syntax for this? Maybe it would be better to standardize macros that implement the same thing on the existing syntax?

The current variant objects are really quite unconvinient, and it would be nice to have syntactic sugar over them to create algebraic types.

@juancarlospaco
Copy link
Contributor

  • Python: Success, used by people, part of StdLib (very recent on very old lang).
  • Racket: Success, used by people, part of StdLib.
  • Swift: Success, used by people, part of StdLib.
  • Grain: Success, used by people, part of StdLib.
  • Rust: Success, used by people, part of StdLib.
  • Nim: Fail, not widely used by people, not part of StdLib.

🤔

@Araq
Copy link
Member

Araq commented May 21, 2023

That doesn't mean much beyond the fact that people avoid dependencies for various reasons.

@Araq
Copy link
Member

Araq commented May 21, 2023

So what does this RFC accomplish that e.g. fusion/matching does not?

You don't need to teach me about ML and Rust and Racket and Foozbaz, I know. You need to tell me why you found the existing solutions based on macros lacking.

@juancarlospaco
Copy link
Contributor

fact that people avoid dependencies

We agree.

@omentic
Copy link
Author

omentic commented May 21, 2023

So what does this RFC accomplish that e.g. fusion/matching does not?

The union type is new. Something similar exists as a macro in beef/fungus, but the implementation differs from how it would be implemented in the compiler for performance and semantic reasons.

fusion/matching also provides comprehensive structural pattern matching, there is no feature discrepancy between the two. This RFC is an attempt to integrate it into the language more: by means of reusing existing case syntax and pulling things that would require specific new syntax and not help with exhaustiveness into where clauses (this is described in more detail in the first paragraph of the notes section above).

After work, I'll write up a detailed comparison between the two because I think it reveals some interesting design decisions. But a place to look for now, if interested, is the difference between this RFC's use of where guards vs. fusion/matching's use of until, all, some, @. Revisiting fusion also reminds me I forgot to describe how tables are matched. It's exactly as you would expect and consistent with tuples etc, I'll add this later too.

@omentic
Copy link
Author

omentic commented May 21, 2023

Also, a little note on other languages: despite Racket's match statement being the most powerful of the bunch and Python being the closest language to Nim syntactically, I took the least inspiration from them. I find Python's match to be grotesquely complex for relatively little benefit, and think where clauses a la Swift to be a much nicer solution. This proposal is also purposely less powerful than Racket: Racket's ability to match on repetitions of arbitrary nested list expressions requires special syntax and is only particularly helpful to a list-oriented language. Nonetheless, it can be done with where clauses if needed (this is a trend...).

@arnetheduck
Copy link

Very nice!

ADT's are to Nim case types what scalars are to vectors - you can express everything an int can express using a seq[int] but doing so is cumbersome in code and misses out on many of the assumptions you can make and therefore on the help that the language and compiler can bring.

It's true that the proposed type sort of is a special case of a case object with a specific form of enum, exactly one discriminator and no other members - but that special case is useful in many constructs which logically express "one and only one of these things" - the rest is followup on top of that concept.

Rather than asking "can macros solve this?", the better question is "can we create better macros with this feature present", and there the answer is undoubtedly yes, because it's such a fundamental construct that appears over and over in many code patterns.

Many things become more simple when you can these assumptions about a type: as mentioned already: resetting or overwriting during serialization, exhaustive case statements, etc.

Here's a simple example which relies on the knowledge that a type takes on exactly one shape: because the type is guaranteed to be exactly one of the enumerated options, we can reason about that option without specifying the full type but still express a meaningful relationship to it. In Result[T, E] this is often useful because you want to be able to express a type that matches any Result[T, X] where X is known but T is not to express the idea that you are working with "the error side of type X of any Result" without knowing the exact type of the result (when returning an error you don't care about the "value" side - this is supplied by the context in which the error is being returned, not the error return itself) - such a type can safely be converted to a "full" Result because it is known per language semantics that Result itself takes one and only one of the given options.

@jmgomez
Copy link

jmgomez commented May 22, 2023

I generally like the idea and I was one of the complainers about the PM situation. But as I said in Discord, if this ever gets accepted, who is going to take ownership and implement it. Especially PM?
I really hope that the roadmap is unaltered and IC still comes right after 2.0.

So what does this RFC accomplish that e.g. fusion/matching does not?

At the very least, it will be a workaround for this :P
nim-lang/Nim#20435

@konsumlamm
Copy link

How would the proposed syntax distinguish between a single field that is a tuple and multiple fields? Using tuples for the latter is not nice imo. I'd prefer something closer to the syntax for constructing/matching the variants (which is also how Rust and Swift do it):

type Shape = union
  Point()
  Line(int)
  Square(side: int)
  Rectangle(x, y: int)
  Triangle(TriangleKind)

@Araq
Copy link
Member

Araq commented May 23, 2023

In my unwritten proposal I used this syntax:

type
  Tree = ref case kind # also possible without the `ref`
    of StringLeaf:
      str: string
    of IntLeaf:
      x: int
    of Parent:
      kids: seq[Tree]

No new keyword required and the syntax is optimized for pasting into a case statement.

@omentic
Copy link
Author

omentic commented May 23, 2023

@konsumlamm Currently:

type Shape = union
  Rectangle: tuple[x, y: int]
type Shape = union
  Rectangle: tuple[tuple[x, y: int]]

I picked this syntax to be as close to Nim's existing semantics as possible while still providing the full power of ADTs. I like your syntax more aesthetically, but IMO it doesn't fit in with tuple / object types as well. Also note that the inner type of Rectangle in both of our examples is tuple[x, y: int]: even if it's not explicitly specified in yours.

@Araq I assume the unwritten proposal was mostly for syntactic sugar over the existing variant object syntax? A benefit of the union proposal and ADTs as a whole is that they let you abstract over arbitrary types, not just named fields of objects. (eg. you don't really need the labels str, x, kids in your Tree implementation, just the types)

@Araq
Copy link
Member

Araq commented May 23, 2023

eg. you don't really need the labels str, x, kids in your Tree implementation, just the types

That's a good point but easily fixed:

type
  Tree = ref case # also possible without the `ref`
    of StringLeaf:
      string
    of IntLeaf:
      int
    of Parent:
      seq[Tree]

The advantages remain: Natural Nim syntax, no new keyword, easily turned into a case statement.

@omentic
Copy link
Author

omentic commented May 23, 2023

I disagree that that's natural Nim syntax, I find it subjectively very ugly, the lack of a field following case inconsistent, and the raw types in branches weird. I don't think being easily turned into a case statement is a particular advantage, and it really doesn't follow type declaration syntax or semantics (I would be quite thrown off learning Nim for the first time).

I'm not particularly attached to the union keyword but I think introducing a new keyword to make it a first-class type on the level of object / tuple / enum would be worth the tradeoff from a language design perspective.

@omentic
Copy link
Author

omentic commented May 23, 2023

I suppose we could also do this leaving eg. nested unambiguously available for use elsewhere, though I don't like it.

type Tree = ref nested enum # oops need indirection
  StringLeaf: string
  IntLeaf: int
  Parent: seq[Tree]

@Araq
Copy link
Member

Araq commented May 23, 2023

I care about keeping the of branches, if you want to change case to sumtype or union I don't mind it. Though both sumtype and union are ugly. Maybe enum?

type
  Tree = ref enum # also possible without the `ref`
    of StringLeaf:
      string
    of IntLeaf:
      int
    of Parent:
      seq[Tree]

@Araq
Copy link
Member

Araq commented May 23, 2023

type Tree = nested enum
  StringLeaf: string
  IntLeaf: int
  Parent: seq[Tree]

is wrong for two reasons: You need the ref indirection in order to nest it. And StringLeaf: string is not a field of type string, it is a branch that supports an unnamed string field.

@omentic
Copy link
Author

omentic commented May 23, 2023

That kinda goes against the existing semantics of enum being explicitly an enumeration though, and also conflicts a bit semantically (not in syntax) with the existing feature that lets you set labels of an enum to values. My original approach was to overload the enum type but I found it conflicted too much with existing expectations of enum.

Now that I think about it more, I really don't like the idea that you can copy and paste the of branches into a case statement actually: most of the times when I use ADTs in other languages I don't want to handle all the cases, and/or want to handle particular cases differently, via structural pattern matching or guards or what not. The extended examples I picked for this RFC are a bit contrived in that they are for interpreting a language where you have to handle every case without restrictions. I think that'd kind of lead people in the wrong direction wrt. pattern matching.

re: ref, oops. ref indirection is actually supported and necessary in this RFC btw, there's some examples in the collapsed Code Examples block.

@omentic
Copy link
Author

omentic commented May 23, 2023

And StringLeaf: string is not a field of type string, it is a branch that supports an unnamed string field.

Yeah, even though they're comparable to objects and the syntax looks similar, they're not. This is one of the few parts that differs from existing semantics. I personally think it's fine - concepts do the same thing and introduce different-than-objects but still very consistent syntax within their type declaration.

@Araq
Copy link
Member

Araq commented May 23, 2023

I think that'd kind of lead people in the wrong direction wrt. pattern matching.

Actually, this is how you ensure correctness, having thought about it, you handle every case explicitly. It's the "oh and in all other cases (else) do ..." that produces bugs. In the Nim compiler and everywhere else.

@metagn
Copy link
Contributor

metagn commented May 23, 2023

Maybe case object?

@nc-x
Copy link

nc-x commented May 23, 2023

If a non-macro solution for ADTs/Pattern Matching is finally being considered, then my vote goes to a rust/haxe like syntax only. A lot of the syntaxes suggested above are confusing / error prone / ugly just like the existing object variant syntax (imo).

@deech
Copy link

deech commented May 23, 2023

ML style ADTs would be a great idea and I won't quibble about the syntax but I worry that too much conditional freedom on each match arm eg. x where typeof(x) is ... will cripple exhaustiveness checking which is biggest unseen benefit of ADT's. I would be happy with much less pattern matching expressivity in return for the compiler flagging missing patterns. IMO previous efforts like fusion/matching and patty do too much.

Have you looked D's sumtype? It is inspired by ADTs but leans heavily into static introspection and would be a nicer fit for Nim vs. trying to copy Haskell/OCaml.

@konsumlamm
Copy link

konsumlamm commented May 23, 2023

Have you looked D's sumtype? It is inspired by ADTs but leans heavily into static introspection and would be a nicer fit for Nim vs. trying to copy Haskell/OCaml.

D's sumtype doesn't support nested patterns, which is half the point of pattern matching and would make a lot of things more cumbersome, so I don't think that's an approach we should follow.

@deech
Copy link

deech commented May 23, 2023

D's sumtype doesn't support nested patterns, which is half the point of pattern matching and would make a lot of things more cumbersome.l, so I don't think that's an approach we should follow.

It's more cumbersome but when you can only destructure on one thing at a time static analysis is much easier, you can get better IDE support, possibly even exhaustiveness checking, and maybe even case splitting like Idris.

@Araq
Copy link
Member

Araq commented May 23, 2023

possibly even exhaustiveness checking

Nim's case statement already does exhaustiveness checking and I have no intention of moving to something that lacks it.

@deech
Copy link

deech commented May 23, 2023

Nim's case statement exhaustiveness checking is pretty brittle:

type
  AnEnum = enum
    A
    B

let e = A

case e
of A: echo "A"
else:
  case e
  of B: echo "B"

Here I get the error:
/tmp/testcase.nim(11, 3) Error: not all cases are covered; missing: {A}

@Araq
Copy link
Member

Araq commented May 23, 2023

It's not "brittle", it works as intended. I'm also not familiar with any mainstream language that offers exhaustiveness checking over nested case/match statements in the way that you outline.

@deech
Copy link

deech commented May 23, 2023

Not mainstream, and not completely generally but with a ton of effort Haskell does. But the point is by making the pattern matching more "cumbersome" but also more analyzable you can get this but without most of the implementation complexity.

@omentic
Copy link
Author

omentic commented May 23, 2023

@deech Ah, I'll write up a little thing on this with more detail. But basically: I'm very aware of this, and designed it so that the where statement is used for things that cannot be exhaustively checked.

@metagn I don't love case object - I think that gives the impression that it is fundamentally an object, which it isn't. Maybe case enum though? Though I do much prefer a new keyword.

@elcritch
Copy link

But union types are useless without some form of structural pattern matching: even if this isn't comprehensive to begin with, full pattern matching should probably be discussed here to avoid creating conflicts in the implementation in the future.

I'd generally agree. Without pattern matching sum types would only provide convenient constructors which macros can provide. I'd still think it'd be great to have but would feel lacking. Really the implementation could ship in multiple parts, but having even a basic pattern matching syntax agreed upon would be useful. Maybe skip where clause semantics for a part 2.

Importantly having pattern matching syntax in the language would prevent splintering and requiring folks to learn multiple different macro based versions. I already didn't want to mix Fungus and Patty in a project as they have different semantics.

Also I don't believe it's possible to do the if shape as Rectangle(x, y) without compiler changes IIRC. ElegantBeef and I were trying some hacks, but it was late night.

BTW, here's another possible syntax:

type
  Tree1 = ref enumeration
    of StringLeaf:
      str: string
    of IntLeaf:
      x: int
      y: int
    of Parent:
      kids: seq[Tree]

  Tree2 = ref enumeration
    of StringLeaf: string
    of IntLeaf: tuple[int, int]
    of Parent: seq[Tree]

The downside is that it's slightly confusing at first why there'd be an enum and enumeration that'd be easy to explain: simple enum, complex enumeration. Personally I like having both a simple enum and a complex enumeration.

@haxscramper
Copy link

@konsumlamm

I'd be wary of special pattern matching syntax for tables.

It should be for contains() and [] operator obviously, not for Table type.

Aka {6: x} is it.contains(6) and x = it[6]

@arnetheduck
Copy link

are useless without some form of structural

I think my point is more that pattern matching should take the whole language into consideration and not only union types (ditto union types themselves). One can easily imagine pattern-matching over other things than union types, including value matching / conditionals / etc and treating them here in the same RFC not only risks blocking union types on bikeshedding of pattern matching but indeed risks not giving due attention to each of them in the context of the rest of the language.

I don't disagree that it's good to keep both things in mind during initial design, but union types are useful on their own (because of the expressivity they add to the type system) and pattern matching isn't limited to them.

@Araq
Copy link
Member

Araq commented May 24, 2023

I think my point is more that pattern matching should take the whole language into consideration and not only union types (ditto union types themselves).

If only we had capable language designers who could have anticipated that... ;-)

@elcritch
Copy link

Good points. I wonder if there's a way to enable just a simple x as Some(y) type syntax using an overloaded operator? With that and case macros one could easily build most (all?) of the pattern matching bits as a library. If that's possible as an operator? Maybe a tuple override.

Then it'd just be settling on a union / sum type / enumeration / maybe-an-adt syntax. That should be easy. ;)

Then we could get sample PRs to get the ball rolling.

@haxscramper
Copy link

If only we had capable language designers who could have anticipated that...

That just allows you to write macro that is syntactically overlapping with case statement ... so basically the link should point to the macro section actually

@haxscramper
Copy link

@J-James I think aside from comparison with hmatching on features you can also ask for feedback on forum/discord like what sucks?, What you don't like, what could be added, what you didn't understand how to use, what feels off -- just some quick reactions

@omentic
Copy link
Author

omentic commented May 24, 2023

Yes, that kind of feedback is very welcome (along with all other feedback)! I've posted the link to this a few times but have not explicitly asked people to go through and evaluate it yet, will do so in the morning.

@metagn
Copy link
Contributor

metagn commented May 24, 2023

Does a language implementation of something like https://github.com/alaviss/union/ or like Crystal's runtime union types (but not implicit or merged with union typeclasses) solve the problems listed in the issue?

type
  A = object
    x, y: int
  Foo = union(int, A)

# under the hood an object variant, but structural and commutative
assert Foo is union(A, int)

proc `$`(f: Foo): string =
  if f of int: # reuse object inheritance check syntax
    result = $int(f)
  elif f of A:
    let f = A(f)
    result = "A(" & $f.x & ", " & $f.y & ")"
  # or as case statement
  case f
  of int: $f # automatic type narrowing? doesn't really make sense but not essential
  of A: "A(" & $f.x & ", " & $f.y & ")"
  # cases exhausted

var f: Foo
# A is convertible to Foo
f = A(x: 1, y: 2)
assert $f == "A(1, 2)"
# so is int
f = 456
assert $f == "456"

# recursion and distinct types
type
  StringLeaf = distinct string
  IntLeaf = distinct int
  ParentTree = distinct seq[Tree]
  Tree = ref union(StringLeaf, IntLeaf, ParentTree)
let tree = Tree(ParentTree("abc"))

(I expect recursion here to work because the following code works, but it might not be reliable)

type
  Bar = int | ref Foo
  Foo = float | Bar
assert int is Foo
assert (ref int) is Foo
assert (ref ref int) is Foo
assert float is Foo
assert (ref float) is Foo
assert (ref ref float) is Foo
assert string isnot Foo
assert (ref string) isnot Foo

This way each branch is tied to a type, construction and destructuring are much simpler to implement (and not ambiguous with existing case statements like every pattern matching solution described above), you don't need to worry about field names or whatever, you can even do things like A of Foo that allow for more introspection.

Is this sound, or is there maybe a problem it doesn't solve?

@Araq
Copy link
Member

Araq commented May 24, 2023

That just allows you to write macro that is syntactically overlapping with case statement ...

I know you don't appreciate the feature but I don't know why that is. It unifies the potential different pattern matching DSLs and is as beneficial as having a unified syntax for assignment and equality for generic code.

@haxscramper
Copy link

haxscramper commented May 24, 2023

unifies the potential different pattern matching DSLs

It only unifies a single keyword, everything else is up to the macro author to write, so it can be hardly called 'unifying'.

Assignment and equality seem to have different syntax: = vs == and they don't allow the user to put on complex custom DSL in the body.

@haxscramper
Copy link

In order for the unification claim to hold true we must ensure that every DSL implemented with case statement follows the same patter and largely has the same semantics, which more or less implies some standardized approach for this, in which case it does not make any sense to aim for several different macros that are "unified" yet different somehow, as opposed to a single standard solution.

@Araq
Copy link
Member

Araq commented May 24, 2023

Let's assume I have a custom type with two fields that I want to be able to match against an array-like pattern:

type
  Pair = object
    x, y: int
var p: Pair

macro `case`(p: Pair; u: untyped) = ...

case p
of [x, _]:

Now if I happen to use something more complex than int inside Pair I can call the case macro for type T inside my case macro for type Pair. That's the power of a unified interface.

Just like you can lift == for seq[T] as long as T has a ==.

@haxscramper
Copy link

The RFC also provides unified interface -- I would argue that it is even more unified because it does not care about Pair as well, only most general API elements such as

  • [index] and len()
  • contains() and [key]
  • .field
  • .kind etc.

So the multiple macro solution does not provide any win here -- unless I'm expected to write the same boilerplate for macro case(p: SomeType, u: untyped) = ... (implement [x, _]syntax,where syntax and so on), macro case(p: AnotherType, u: untyped) = ... (implement [x, _]syntax,where syntax and so on), macro case(p: ThirdType, u: untyped) = ... (implement [x, _]syntax,where syntax and so on)

which might be unified into a single case -- instead of layering unified implementations on top of each other

@Araq
Copy link
Member

Araq commented May 24, 2023

Cool but case statement macros predate this RFC. So ok, we don't need case statement macros leaving us with the question of whether to use match or case for pattern matching. I vote for case.

@deech
Copy link

deech commented May 28, 2023

Since it was asked earlier, here is an example of Haskell's coverage spanning nested pattern matches. https://old.reddit.com/r/haskell/comments/13tahn8/til_lower_your_guards_in_ghc_9/

@haxscramper
Copy link

haxscramper commented Jun 11, 2023

@Araq for pattern matching, what do you think of the following action plan about finally deciding on how it should be implemented in the language and then proceeding with the implementation. Over the last years (and different implementations) discussion largely revolved around multiple different topics, going into "That doesn't mean much beyond the fact that people avoid dependencies for various reasons." and other things.

I think at this point most of the existing solutions have been tried, so the questions that are left are rather yes/no kind

  1. Would the pattern matching be accepted as a language built-in -- yes/no?
  2. If no, would the pattern matching be accepted as a macro in the standard library along the lines of the std/sugar

Note that "yes" does not mean "yes, any garbage that you write will be merged" but rather "people who are willing to design and implement this feature and receive the feedback on the implementation quality can expect their work to actually move towards the final goal of getting the pattern matching in the stdlib within a reasonable amount of time (post 2.0 or post 2.2 or what you call it, just not years and years of waiting)"

@haxscramper
Copy link

If none of these are affirmative, then the whole discussion can pretty much end there, otherwise we can go to the

  1. List of features that need to be provided for pattern matching and for sum types
  2. Syntax for these features

In this order of importance

@Araq
Copy link
Member

Araq commented Jun 11, 2023

Would the pattern matching be accepted as a language built-in -- yes/no?
If no, would the pattern matching be accepted as a macro in the standard library along the lines of the std/sugar

Yes, in principle pattern matching would be accepted into the language core. It can be done like so:

  • Syntax additions (grammar update, parser update)
  • Spec outlines the rewrite rules so that custom types with custom == operators etc are supported.
  • The rewrite rules can be done via a macro or within the compiler, it doesn't matter much as long as the spec clearly covers which parts of the rules cannot in principle be offered by a macro, if any.

Notice that even though the syntax is the least important part, it's the part that most likely implies changes/additions to the core language since the rest of Nim excels at AST rewrites.

@metagn
Copy link
Contributor

metagn commented Jun 11, 2023

I wasn't going to comment on the pattern matching part of this proposal (counterproposal for ADTs in #527) because it's not something I'm invested in, but since we're talking about adding it to the language I have some issues.

This RFC is fully backwards compatible with Nim.

No it's not. You break this.

proc foo(x: int) =
  const abc = 1
  case x
  of abc: echo "was the constant abc"
  else: echo "was any other number"

foo(1) # was the constant abc
foo(2) # was the constant abc

A rule "any constant in scope will be resolved" here would be bad too, because then when you lose access to the constant or write it with a typo, you use a match all pattern that can cause bugs. Either use match instead of case and require constants to use ^abc or something, or force let abc to bind a value. This is important for all pattern matching proposals.

I think I like oldident: newident for field renaming. It matches object constructors. oldident as newident is an alternative, however. edit: no longer - as is used for if let

For "field renaming" I understand, but for renaming a pattern?

case x
of (a, b): c: ...

There's not really a choice other than as. For if let we can generalize existing infix of which already kind of does pattern matching.

I'm considering allowing guards to be reusable on for blocks a la Swift: with for binding in iterator where condition acting as syntax sugar to the common for binding in iterator: if condition. Is this desired?

I think some people talked about preferring just having for binding in iterator: if condition in a discussion about sugar.collect. For of in case in the existing language it might make sense but then we would need some of ... else: for exhaustiveness checks.

Collection matching

I put the specific issues with the collection matching in this proposal in the following details block because I think there is a more general problem which I will describe after.

# match a set containing any two distinct elements fulfilling a condition
of {x, y, ..} where x + y == 10:

Due to how bitset iteration works, this pattern is O(n^p) where p is the number of variables you used and n is either the size of the set if we are considering the x + y == 10 operation, or the maximum size of the set if we are considering the operation which checks which elements of the type (in this example int8) are in the set. This even applies to just {x, y}.

of [first, second, third .. penultimate, last]:

This makes it seem like third == penultimate when the array has 4 elements, but you describe .. to "denote a range of fields we do not care about matching" which implies you really mean:

of [first, second, third, .., penultimate, last]:

In this case I think it would be better to do:

of [first, second, third, ..middle, penultimate, last]:

Where middle matches every element between third and penultimate, and you can use _ instead to discard them.

(end of details)

The more general problem is this is the exact kind of stuff we shouldn't have in the language and libraries should (and already do). In general I think we should minimize the "conditionality" of the pattern matching and focus on the destructuring syntax so we can get it consistent with the existing language (tuple unpacking matches tuple construction, so we should match named tuple construction, object construction etc). Even the ADTs in this proposal serve this purpose: we have a simpler alternative to the conditionality of object variant kind branches.

I'm not even particularly passionate about destructuring. There's not much ergonomic difference between foo of Rectangle(x, y): x + y and foo of Rectangle: foo.x + foo.y. But I do understand separating it from control flow has benefits (i.e. getting to use lent, not having to deal with things like nim-lang/Nim#22060). The collection matching stuff is related to analysis completely separate from the compiler or core language.


Bonus:

As a result, variant objects additionally introduce overhead from the tag that need not exist. (the most minor of their issues, IMO)

The tag absolutely needs to exist. There is no overhead. The alternative is C unions ({.union.}) which are not memory safe. https://en.wikipedia.org/wiki/Tagged_union

Grain: Success, used by people
Nim: Fail, not widely used by people

???

@omentic
Copy link
Author

omentic commented Jun 12, 2023

Thanks for the detailed feedback! That case can currently refer to existing unrelated bindings is a serious problem and not something I had considered. I wanted to use of x where x == abc to avoid issues with ambiguity in bindings (i.e. don't allow access to prior bindings outside of where statements), but I'll need to think through how to deal with this. My gut reaction is to say "eh, it probably isn't used widely"... but it probably is used widely enough. My second gut reaction is to say "let's change the semantics of case/of around union types in particular": but that goes against what this proposal is striving for, using existing semantics as much as possible. I'm gonna sit on this one and try to puzzle it out.

Note that I don't intend for field renaming to extend to patterns. Right now, this RFC only allows for field renaming to work on fields in objects and named tuple: I don't want it to act as a general "bind" statement because it isn't needed and it adds syntactic complexity. eg. every time you want to bind to an expression, you can just use its inner fields instead. (this is because a lot of stuff is moved to the where statement)

Re: minimizing the conditionality, I think I agree - for a first attempt I want to implement the union type (or whatever we end up calling it) and unpacking via if lets, and save the pattern matching for a follow-up PR. I'm ambiguous as to what form if lets take (as long as it is not if let! i hate rust's syntax there), either using as or of seems fine but using of would beg the question of if it follows existing syntax, and what limitations surround it there. Perhaps less of a concern than the case bindings issue, though.

You're very right about the tag also: I don't know what I was writing there? I think I meant to refer to the overhead of having all cases in a variant object vs. only having to worry about the size of the greatest, though I'm not sure if Nim's object variants are implemented in this fashion (it would be a bit naive if they were). It certainly needs a tag of course.

@Araq
Copy link
Member

Araq commented Jun 12, 2023

Fwiw I personally only want of branch where condition: that I've seen proposed with if and and instead of where too so it wouldn't even require a new keyword. If a where is used, enforce that an else exists so exhaustiveness is kept in a most trivial manner.

For binding of values I would add _ as nameHere to the language. _ is already Nim's wildcard and as is not yet widely used in case statements. Then there is special sugar for x: _ as x that can be simplified to x:

case obj
of Obj(x) where x == 10: echo "match"
of Obj(x: _ as y) where y == 21: echo "different match"
else: echo "no match"

Whether obj supports patttern matching or not can continue to be done with macro case overloading. And that is actually preferable because the used algorithm can actually be quite elaborate, consider how LuaJIT implemented its rewrite rule engine with a custom hashing scheme. (Yes, I still like case statement overloading, I think it was a mild stroke of genius when I came up with it...)

@omentic
Copy link
Author

omentic commented Jun 12, 2023

Also, @metagn a note on exhaustiveness (because I said I'd provide that ages ago :-P).

I generally designed this so that expressions that can contribute to the exhaustivity of a pattern match are done within the of expression: that is, generic variable binding (which is exhaustive by nature, of x), explicit matching on fields of "small" types (of true, false, char, enum, int8, etc), and union variants (of Rectange(x, y), of Square(side), ...). The where statement provides a convenient outlet for "general conditionals" that do not contribute much if anything to exhaustivity: ex. of x where x == 5. I think we could entirely skip analysis of where statements in the compiler for exhaustivity, and don't regard them as an exhaustive match: i.e. of x where x == 5 would have to be followed by a plain of x or an else. This means, of x where x == true, of x where x == false is not exhaustive, but of true, of false is.

(lol @Araq - looks like we're on the same page)

This ideal perhaps falls apart a bit with collection matching as you noted. One, collection types are very hard to be exhaustive on: if you're not working with (short) arrays you need a wildcard .. at some point. Two, yeah, it looks like arbitrary where clauses on bitsets can lead to performance pitfalls (more so than lists or structured types it seems). I would be down to drop support for bitsets and tables entirely or at least table (heh) it for now, though I guess it's still good to keep in the back of our heads when thinking about syntax. Three, wrapping back to one a bit: I found it much harder to walk the line of "let this stuff be in a where clause" when dealing with sequences. Technically you don't need the elements matched by .. to be accessible because you know their start and end index. But that's ugly and gross and so maybe some more custom syntax there is warranted.

Anyway, that's just a bit of a brain dump. I think I agree that ADTs + destructuring are the most important part of this proposal, though I probably would have argued for case statements and basic exhaustivity (w/o seqs, sets, tables) being a necessary companion had you not brought up the existing case conflict.

@Araq
Copy link
Member

Araq commented Jun 12, 2023

This ideal perhaps falls apart a bit with collection matching as you noted.

What's the point of collection matching though? A collection typically holds a variable number of things and the pattern by its nature can only deal with a fixed amount of variables. Sounds like solving a problem for people who used the wrong types to begin with.

@omentic
Copy link
Author

omentic commented Jun 22, 2023

After thinking it over for a while I really like @Araq's value binding syntax. It sidesteps the backwards compatibility issue @metagn brought up nicely while the sugar keeps it clean for the most common use case, inner types of objects / named tuples. However: the second most common case is single types within an ADT: ex.

type JsonValue = union
  Boolean: bool
  Number: int
  String: string
  Array: seq[ref Value]
  Object: Table[string, ref Value]

Matching upon those would then look like this:

case value
of Boolean(_ as x):
  ...
of Number(_ as x):
  ...
of String(_ as x):
  ...
of Array(_ as x):
  ...
of Object(_ as x):
   ...

Which is a bit gross. But I don't know what kind of syntax sugar would work well here and not conflict with existing case semantics.

@omentic omentic changed the title Algebraic data types and structural pattern matching Proper sum types and structural pattern matching Jul 20, 2023
@jmgomez
Copy link

jmgomez commented Sep 30, 2023

Macros like these have issues along the lines "can't work in generics" nim-lang/Nim#20435 -- that is something that has been an issue since day 1 and small number of complaints about this is something that I can only attribute to the low use of the fusion/matching

For those interested, the issue is fixed now.

@simonkrauter
Copy link

Related: #548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests