Skip to content

Commit

Permalink
Feature/tutorial 9 (#36)
Browse files Browse the repository at this point in the history
* update readme

* wip more memoize

* remove some memoize

* try some more memoize

* wip basic TDD tests for symbolic graph

* wip symbolic graph

* refactor 1

* refactor 2

* wip

* wip

* wip

* wip tutorial Leibniz equality

* wip

* reformat

* fix

* reformat

* finish write-up about Leibniz equality

* wip

* add code for factorial

* wip

* wip
  • Loading branch information
winitzki authored Aug 9, 2024
1 parent f9de32e commit e99f412
Show file tree
Hide file tree
Showing 12 changed files with 866 additions and 239 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,8 @@ assert(factorial(BigInt(10)) == BigInt(3628800))
- [x] GitHub Actions are used to test with JDK 8, 17, 22.

- [x] Converting Dhall values to Scala values: basic support is complete.
- [x] Standalone executable JAR with command-line arguments for type-checking, evaluating, and exporting Dhall expressions.
- [x] Converting Dhall to Yaml and JSON: complete.

### Experimental features and optimizations

Expand All @@ -180,8 +182,6 @@ assert(factorial(BigInt(10)) == BigInt(3628800))

- [x] Print Dhall values using the standard Dhall syntax.

- [x] Export Dhall values to Yaml for most of the relevant types (numbers, strings, Booleans, lists, records).

## Ideas for future developments

1. Implement automatic type inference for certain solvable cases. Omit type annotations from lambdas and omit
Expand Down Expand Up @@ -228,6 +228,7 @@ for keywords.
To improve parsing performance, the parsing results for some sub-expressions are memoized.
This is implemented via an add-on library `fastparse-memoize`.
See [fastparse-memoise README](fastparse-memoize/README.md) for more information.
That library is published as a separate artifact.

#### Limitations

Expand Down
1 change: 1 addition & 0 deletions how_to_deduplicate_strings.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
java -Xlog:stringdedup*=debug -XX:+UseG1GC -XX:+UseStringDeduplication -jar dhall.jar --file scall-cli/src/test/resources/yaml-perftest/create_yaml.dhall yaml
198 changes: 198 additions & 0 deletions nano-dhall/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Specification and implementation of "Nano-Dhall"

The "Nano-Dhall" project illustrates how a (very small) functional programming language can be specified and implemented.
Nano-Dhall is a subset of Dhall limited to:

- Function types `∀(a : t1) → t2` and function values `λ(a : t) → b`.
- Function application: `f x`, grouping to the left, so that `f x y = (f x) y`.
- Built-in type `Natural` with operations `a + b`, `a * b`, and `Natural/fold`
- Built-in type `Text` with the concatenation operation (`x ++ y`).
- Type universes `Type` and `Kind`, so that we have typing judgments `Nat : Type` and `Type : Kind`. (The symbol `Kind` has no type.)

We have omitted records, union types, imports and many other features of Dhall.
But all of those features (except imports) can be encoded in Nano-Dhall using just functions and types.

The syntax of Nano-Dhall is simplified so that there are fewer keywords and fewer equivalent ways of writing the same things.

We kept just two built-in types (`Natural` and `Text`) with some built-in operations, in order to show how such features are implemented.

For clarity, the impementation will be as simple and straightforward as possible.

### Encoding various features in Nano-Dhall

Here we will briefly show examples of Nano-Dhall code that reproduce some of the missing features.

- Void type: `∀(a : Type) → a` (instead of Dhall's `<>`)
- Unit type: `∀(a : Type) → ∀(x : a) → a` instead of Dhall's `{}`
- Unit value: `λ(a : Type) → λ(x : a) → x` instead of Dhall's `{=}`
- "Let"-expression: `(λ(a : t) → b) x` instead of Dhall's `let a : t = x in b`. We will replace "let"-expressions by "lambda"-expressions at parsing time.
- Pair type: `λ(a : Type) → λ(b : Type) → ∀(r : Type) → ∀(k : a → b → r) → r` instead of Dhall's `λ(a : Type) → λ(b : Type) → { _1 : a, _2 : b}`
- Pair value: `λ(a : Type) → λ(b : Type) → λ(r : Type) → λ(k : a → b → r) → r`

### Syntax and parsing into the syntax tree

The syntax of Nano-Dhall is described in the ABNF format.
This is a greatly simplified version of `dhall.abnf`.

The main entry point is `complete-expression`.

```
complete-expression = whsp expression whsp
; Optional whitespace: zero or more whitespace characters.
whsp = *whitespace-chunk
; Non-empty whitespace: one or more whitespace characters.
whsp1 = 1*whitespace-chunk
whitespace-chunk =
" "
/ tab
/ end-of-line
expression =
lambda whsp "(" whsp identifier whsp ":" whsp1 expression whsp ")" whsp arrow whsp expression
/ forall whsp "(" whsp identifier whsp ":" whsp1 expression whsp ")" whsp arrow whsp expression
/ let whsp1 identifier whsp ":" whsp1 expression whsp "=" whsp expression whsp1 "in" whsp1 expression
/ expression0
lambda = %x3BB / "\"
arrow = %x2192 / "->"
forall = "forall" / %x2200
keyword = forall / "let" / "in" ; We have just these keywords, for now.
; Precedence tower.
expression0 = expression1 [ whsp ":" whsp1 expression ] ; Type annotation.
expression1 = expression2 *(whsp "+" whsp expression2) ; Nat + Nat
expression2 = expression3 *(whsp "++" whsp expression3) ; Text ++ Text
expression3 = expression4 *(whsp "*" whsp expression4) ; Nat * Nat
expression4 = expression5 *(whsp1 expression5) ; Function application.
; End of precedence tower.
expression5 = primitive-expression
primitive-expression = natural-literal / text-literal / builtin / identifier / "(" complete-expression ")"
natural-literal = NONZERODIGIT *DIGIT / "0"
builtin = "Natural/fold" / "Natural" / "Text" / "Type" / "Kind"
identifier = builtin 1*label-next-char / !builtin label
label =
keyword 1*label-next-char
/ !keyword (label-first-char *label-next-char)
label-first-char = ALPHA / "_"
label-next-char = ALPHANUM / "-" / "/" / "_"
; Uppercase or lowercase ASCII letter.
ALPHA = %x41-5A / %x61-7A
; ASCII digit
DIGIT = %x30-39 ; 0-9
NONZERODIGIT = %x31-39
ALPHANUM = ALPHA / DIGIT
text-literal = %x22 *inside-double-quote %x22
; Printable characters except double quote and backslash.
inside-double-quote =
%x20-21
; %x22 = '"'
/ %x23-5B
; %x5C = "\"
/ %x5D-7F
/ valid-non-ascii
valid-non-ascii = ; Let us just support some Unicode.
%x80-D7FF
; %xD800-DFFF = surrogate pairs
/ %xE000-FFFD
; %xFFFE-FFFF = non-characters
/ %x10000-1FFFD
```

The result of parsing is a syntax tree. The data type for the tree could look like this (Haskell) code:

```haskell
type Ident = String
type Operator = Plus | Concat | Times
data Expr = Lambda Ident Expr Expr -- λ(ident : expr1) → expr2
| Forall Ident Expr Expr -- ∀(ident : expr1) → expr2
| Annot Expr Expr -- expr1 : expr2
| Op Expr Operator Expr -- expr1 operator expr2
| App Expr Expr -- expr1 expr2
| NaturalLit Int -- 123
| TextLit String -- "abc"
| Var Ident Int -- A variable with a de Bruijn index. In Nano-Dhall, `x@1`.
| Builtin String -- One of the built-in symbols (`Natural`, `Type`, etc.).
```

Remarks:

- Types and values are both represented by an expression (`Expr`).
Later, the type-checking procedure will figure out which is which.
This allows us to write code like `λ(a : Type) λ(x : a) ...`

- Multiple-arity operations are parsed into left-associative, nested binary operations.
For example, `1 + 2 + 3` is parsed into the following expression tree:

```haskell
Op (Op (NaturalLit 1) Plus (NaturalLit 2)) Plus (NaturalLit 3)
```

- All "let"-expressions are immediately rewritten as `App` terms.

- All built-in operations are associative (as in Dhall).

- Nano-Dhall does not support explicit de Bruijn indices in the syntax; but internally, all variables must have a de Bruijn index.
Parsing `x` will produce `Var "x" 0`, with the de Bruijn index equal to `0`.

Any parsing library can be used to implement parsing.
The parsing procedure uses negative lookahead in some places, to prevent variable names from being keywords or built-in symbols.

### Type checking

Type checking in Nano-Dhall is a function from an input pair `(context, expression)` to an output .

The input `context` is a set of `(variable, expression)` pairs, showing all the variables whose types are already known.
The input `expression` is a Nano-Dhall expression with or without existing type annotations.

The output of typechecking is either a success with an output `expression` or a failure with a message describing a type error.
In case of success, the output `expression` has a type annotation indicating the type of the expression.

For example, the typechecking of `(context = [y : Natural], λ(x : Natural) → x + 1) ` succeeds and outputs:

`(λ(x : Natural) → x + 1) : ∀(x : Natural) → Natural`

This typechecking function is different from what is described in the Dhall standard.
There, typechecking does not modify the original expression but merely computes its type (or determines that there is a type error).

Making the typechecker return a different expression opens further possibilities for improving the language's usability and tooling:

- One could return partial type-checking results (with type annotations at valid sub-expressions) even when there are type errors in the input.
- The evaluator can do more simplification if type information is embedded into the expression being evaluated. Currently, Dhall's evaluator (beta-normalization) is designed to work entirely without any type information. This limits its ability to perform certain simplifications that are possible only if types of certain sub-expressions are known.
- Type inference and/or term inference can be performed at the type-checking stage. The output of type checking can be an expression not only with additional type annotations, but also with some additional terms; for instance, automatically inferred type parameters or other inferrable terms.

The type-checking algorithm proceeds recursively through the structure of the given expression.
When a sub-expression is a function or a function type, a new variable is bound.
So, the type-checking for the function body will proceed with the context that contains the new variable.
To avoid name clashes, the body needs to be modified if it already contains a variable with the same name.



### Alpha-normalization

### Beta-normalization

### One-place type inference



Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ trait HashDict[A] {
def store(value: A): Int
}

class LRUHashDict[A](maxSize: Int) extends HashDict[A] {
class ConcurrentHashDict[A](maxSize: Int) extends HashDict[A] {
private val valueDict: ConcurrentMap[Int, A] = new ConcurrentHashMap[Int, A]
private val keyDict: ConcurrentMap[A, Int] = new ConcurrentHashMap[A, Int]

Expand All @@ -21,5 +21,3 @@ class LRUHashDict[A](maxSize: Int) extends HashDict[A] {
key
}
}

object LRUHashDict {}
130 changes: 130 additions & 0 deletions nano-dhall/src/main/scala/io/chymyst/nanodhall/NanoDhallGrammar.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
package io.chymyst.nanodhall

import fastparse.NoWhitespace._
import fastparse._

import scala.util.{Failure, Success, Try}
import io.chymyst.fastparse.Memoize.MemoizeParser

object NanoDhallGrammar {

def ALPHA[$: P] = P(
CharIn("\u0041-\u005A", "\u0061-\u007A") // A_Z | a_z
)

def DIGIT[$: P] = P(
CharIn("0-9")
// 0_9
)

def end_of_line[$: P] = P("\n" | "\r\n")

def valid_non_ascii[$: P] = P(
CharIn(
"\u0080-\uD7FF",
// %xD800_DFFF = surrogate pairs
// "\uE000-\uFFFC",
"\uE000-\uFFFC", // Workaround: Disallow the "replacement" character ("\uFFFD") because it will be generated for invalid utf-8 encodings.
)
// Encode other Unicode ranges into Java's UTF-16 using UTF-16 surrogates.
// See https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=10000&mode=hex and look for "UTF-16 surrogates".
// %xFFFE_FFFF = non_characters
// | % x10000_1FFFD
// U+10000 = "\uD800\uDC00"
// U+103FF = "\uD800\uDFFF"
// U+10400 = "\uD801\uDC00"
// U+1FFFD = "\uD83F\uDFFD"
// format: off
| (CharIn("\uD800-\uD83E") ~ CharIn("\uDC00-\uDFFF"))
// format: on
| (CharIn("\uD83F") ~ CharIn("\uDC00-\uDFFD"))
// // %x1FFFE_1FFFF = non_characters
// | % x20000_2FFFD // U+20000 = \uD840\uDC00
| (CharIn("\uD840-\uD87E") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uD87F") ~ CharIn("\uDC00-\uDFFD"))
// // %x2FFFE_2FFFF = non_characters
// | % x30000_3FFFD
| (CharIn("\uD880-\uD8BE") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uD8BF") ~ CharIn("\uDC00-\uDFFD"))
// // %x3FFFE_3FFFF = non_characters
// | % x40000_4FFFD
| (CharIn("\uD8C0-\uD8FE") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uD8FF") ~ CharIn("\uDC00-\uDFFD"))
// // %x4FFFE_4FFFF = non_characters
// | % x50000_5FFFD
| (CharIn("\uD900-\uD93E") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uD93F") ~ CharIn("\uDC00-\uDFFD"))
// // %x5FFFE_5FFFF = non_characters
// | % x60000_6FFFD
| (CharIn("\uD940-\uD97E") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uD97F") ~ CharIn("\uDC00-\uDFFD"))
// // %x6FFFE_6FFFF = non_characters
// | % x70000_7FFFD
| (CharIn("\uD980-\uD9BE") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uD9BF") ~ CharIn("\uDC00-\uDFFD"))
// // %x7FFFE_7FFFF = non_characters
// | % x80000_8FFFD
| (CharIn("\uD9C0-\uD9FE") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uD9FF") ~ CharIn("\uDC00-\uDFFD"))
// // %x8FFFE_8FFFF = non_characters
// | % x90000_9FFFD
| (CharIn("\uDA00-\uDA3E") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uDA3F") ~ CharIn("\uDC00-\uDFFD"))
// // %x9FFFE_9FFFF = non_characters
// | % xA0000_AFFFD
| (CharIn("\uDA40-\uDA7E") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uDA7F") ~ CharIn("\uDC00-\uDFFD"))
// // %xAFFFE_AFFFF = non_characters
// | % xB0000_BFFFD
| (CharIn("\uDA80-\uDABE") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uDABF") ~ CharIn("\uDC00-\uDFFD"))
// // %xBFFFE_BFFFF = non_characters
// | % xC0000_CFFFD
| (CharIn("\uDAC0-\uDAFE") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uDAFF") ~ CharIn("\uDC00-\uDFFD"))
// // %xCFFFE_CFFFF = non_characters
// | % xD0000_DFFFD
| (CharIn("\uDB00-\uDB3E") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uDB3F") ~ CharIn("\uDC00-\uDFFD"))
// // %xDFFFE_DFFFF = non_characters
// | % xE0000_EFFFD
| (CharIn("\uDB40-\uDB7E") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uDB7F") ~ CharIn("\uDC00-\uDFFD"))
// // %xEFFFE_EFFFF = non_characters
// | % xF0000_FFFFD
| (CharIn("\uDB80-\uDBBE") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uDBBF") ~ CharIn("\uDC00-\uDFFD"))
// U+F0000 = "\uDB80\uDC00"
// U+FFFFD = "\uDBBF\uDFFD"
// // %xFFFFE_FFFFF = non_characters
// | % x100000_10FFFD
| (CharIn("\uDBC0-\uDBFE") ~ CharIn("\uDC00-\uDFFF"))
| (CharIn("\uDBFF") ~ CharIn("\uDC00-\uDFFD"))
// U+100000 = "\uDBC0\uDC00"
// U+10FFFD = "\uDBFF\uDFFD"
// %x10FFFE_10FFFF = non_characters
)
.memoize

def tab[$: P] = P("\t")

def whitespace_chunk[$: P] = P(
" "
| tab
| end_of_line
)
.memoize

def whsp[$: P]: P[Unit] = P(
NoCut(whitespace_chunk.rep)
)

def whsp1[$: P]: P[Unit] = P(
NoCut(whitespace_chunk.rep(1))
)

def ALPHANUM[$: P] = P(
ALPHA | DIGIT
)

}
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
package io.chymyst.nanodhall.unit

import com.eed3si9n.expecty.Expecty.expect
import io.chymyst.dhall.LRUHashDict
import io.chymyst.dhall.ConcurrentHashDict
import munit.FunSuite

class LRUHashDictTest extends FunSuite {
class ConcurrentHashDictTest extends FunSuite {
test("hash strings") {
val dict = new LRUHashDict[String](10)
val dict = new ConcurrentHashDict[String](10)
val abc = dict.store("abc")
val cde = dict.store("cde")
expect(abc != cde)
Expand Down
Loading

0 comments on commit e99f412

Please sign in to comment.