uip | title | description | author | status | type | category | created |
---|---|---|---|---|---|---|---|
71 |
Aura Renovation |
Address inconsistencies in the aura type system |
~ponmep-litsem |
Draft |
Standards Track |
Hoon |
2023-06-30 |
The aura type system provides a way to indicate a particular interperation of an atom in Hoon. An aura of an atom influences both how it is presented to the user, as well as specifies its parsing routine. Currently, auras are used in an inconsistent manner, thus preventing enforcement of atom sanity at the compiler level. We address these inconsistencies together with related problems. With the implementation of this proposal auras become a correct type system tool, paving the way for compile-time atom sanity checking.
There are several ways in which auras are used in inconsistent manner.
-
The
aura
type itself, defined as@ta
, violates the constraint on knots, which do not permit uppercase letters. -
Use of
term
type is too restrictive for an aura variable, which admits uppercase bit-width indicator. An example of this inconsistency is in the Hoon atom definition,+$ hoon ... [%rock p=term q=*] [%sand p=term q=*]
as well as in many functions throughout sys/hoon.hoon
which accept term
for what is an aura
. This is seen in practice with tapes, defined as (list @tD)
. Another example is +fitz
which accepts a term
for a general aura.
-
There are two Hoon literals which have uppercase letters in their digit set:
@uc
and@uw
. Thescot
atom printer returns a@ta
, and theslaw
family of atom parsers similarly violates this same constraint, accepting a@ta
as their string argument. -
The
+slaw
family of atom parsers accepts a@tas
for an aura, thus restricting valid auras with bit-width indicator. -
Clay persists invalid paths according to the path specification as a list of
@ta
. These can either be some valid@uc
,@uw
atoms, or, more generally, any path addressing a file whose name contains uppercase letters.
These violations of the aura type system make it impossible to enforce sanity of atoms at compile-time, as originally envisioned with the +sane
and +ruth
arms and had already caused problems in %clay
handling of paths and desks. Fixing the aura type system, as proposed here, is a required step before enabling compile-time
atom sanity checks in the future.
Apart from the inconsistent use of the auras, there are also two related problems:
- (A)
+sane
, so far not used at the compiler level, is under-specified and performs only some atom sanity checks - (B) Default hexadecimal printing of user-defined auras could become a source of unforseen bugs should this behaviour be changed in the future
- The knot aura
@ta
denotes a string composed of allowed characters:[0-9], [a-z], [A-Z], '_', '-', '~', '.'
. It is considered URL-safe, and allows the encoding of all Hoon constants. - Aura syntax. A valid aura consists of three parts prefixed by
@
: the major part, the bit-width (optional), and the minor part (optional). The major part is matched by[a-z]*
. The bit-width is matched by[A-Z]
. The minor part is matched by-[a-z]+
. Examples of valid auras include@u
,@uD
,@uv-blob
,@uxG-hash
. - Aura nesting rules. Aura B nests in aura A if the following conditions are all satisfied. (1) Nesting in the major part. B nests in A if the major part of B is an extension of the major part of A. (2) Nesting in the bit-width. B nests in A if the bit-width of B is not greater than the bit-width of A. Absence of bit-width indicates unlimited size. (3) Nesting in the minor part. B nests in A if the minor part of B matches the minor part of A, or if A is the universal aura
@
. - Universal aura
@
. All atoms nest under the universal aura@
. This is a consequence of specified nesting rules. - Aura compatibility. Auras
A
andB
are said to be compatible if and only if A nests in B or B nests in A. +fitz
is defined to be an aura nest check.- The
@c
aura denotes a well-formed UTF-32 code unit sequence encoded in UTF-32LE encoding form, according to Definitions 88 and 90 of the Unicode standard. Detailed specification is found in the specification supplement. @t
aura denotes a well-formed UTF-8 code unit sequence encoded in UTF-8 encoding form, according to Definitions 86 and 92 of the Unicode standard. In addition,@t
excludes the control code points0x00..0x09
and0x0B..0x1F
, but includes the line feed code point0x0A
. Detailed specification is found in the specification supplement.- Conversion from a @c atom to a @t atom. Since a @c atom can assume values that represent control code points invalid for a @t atom, the conversion routine shall prevent creation of invalid @t atoms by crashing on invalid inputs.
- The pretty-printer renders user-defined auras according to their nearest defined nest.
The exclusion of uppercase letters from the knot aura @ta
specification seems unnecessary. Relaxation of this condition alone solves problems 1–5. There seems to be little justification for it at present, beyond enforcing strict URL safety. It is clear that such strictness can no longer be maintained.
Hoon literals can not be considered URL-safe unless uppercase letters are permitted. Overwhelming majority of websites is served by UNIX systems, which are naturally case sensitive. Clay too has to allow uppercase filenames.
We therefore propose to relax this restriction and allow uppercase letters in @ta
, declaring it to be URL-safe.
The nesting rules for the major part are modified. Previously, the nesting rules in the major part functioned as compatibility check, as realized by the +fitz
arm. This enables the programmer to produce atoms violating the sanity check.
> (fitz %ta %tas)
> %.y
> (fitz %tas %ta)
> %.y
> ^-(@tas 'MAD')
%MAD
Meawhile, in the bit-width +fitz
functioned as a nest check proper.
> (fitz 'uE' 'uD')
> %.y
> (fitz 'uD' 'uE')
> %.n
We restore consistency by redefining +fitz
as a one-directional aura nest check.
Currently, only the last uppercase letter is treated as a bit-width extension by the aura compatibility check arm +fitz
. This means that an aura like @uDE
denotes an aura of type @uD
with the bit-width extension of E
-- 16 bits. We propose to simplify things by further restricting the bit-width syntax: only a single, trailing uppercase letter is allowed in the aura specification. Thus an aura @uDE
is invalid, while @uD
and @uE
are allowed.
The minor part of an aura allows us to introduce more descriptive aura types, to state the indented purpose of the aura, in addition to its semantics. For instance, @uxD-flag
could be used when communicating with hardware devices, while @uv-blob
could be used in place of the less readable @uvblob
. The minor parts nest based on exact match, as opposed to major parts, which nest by extension. In particular, @uv-blob
does not nest under @uv
, but it does nest under @-blob
. As a special case, the universal aura @
is considered to encompass auras, regardless of the minor part.
An example implementation of the new +mota
to be introduced by this proposal is presented below.
|= a=tape
=<
(scan a mota)
|%
++ mota ;~(pfix pat ;~(plug major (punt bit-width) (punt minor)))
++ major (star low)
++ bit-width hig
++ minor ;~(plug hep (plus low))
--
The default printing of user-defined auras as hexadecimal number without separators tempts the programmer to exploit this behaviour as a convenient way to print hexadecimal atoms. It is also inconsistent with the rendering of the empty aura @
as an unsigned decimal atom. Rather, user-defined auras should render in the same way as the @
aura, under which they nest.
In UTF-32LE encoding form, each Unicode scalar value is assigned a 32-bit number with the same value as the Unicode scalar value. This number is then encoded in a Hoon atom in little-endian byte order. A @c atom is a sequence of Unicode scalar values encoded in this manner.
The maximum Unicode scalar value is 0x10FFFF
. Since surrogate code points are not in the set of Unicode scalar values, the code points range 0xD800..0xDFFF
is excluded.
In UTF-8 encoding form, each Unicode scalar value is assigned a sequence of one to four bytes, following Definition 92 of the Unicode Standard. A @t atom is a sequence of bytes resulting from encoding a sequence of Unicode scalar values in this manner.
Surrogate code points are likewise excluded. In addition, the control code points 0x00..0x09
and 0x0B..0x1F
are excluded, but the line feed code point 0x0A
is included.
The mapping between Unicode scalar values and UTF-8 byte sequences is as follows.
Scalar Value | First Byte | Second Byte | Third Byte | Fourth Byte |
---|---|---|---|---|
0000.0000 0xxx.xxxx | 0xxxxxxx | |||
0000.0yyy yyxx.xxxx | 110yyyyy | 10xxxxxx | ||
zzzz.yyyy yyxx.xxxx | 1110zzzz | 10yyyyyy | 10xxxxxx | |
000u.uuuu zzzz.yyyy yyxx.xxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx |
Code Points | First Byte | Second Byte | Third Byte | Fourth Byte |
---|---|---|---|---|
U+000A, U+0020..U+007F | 0A, 20..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | ==A0==..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..==9F== | 80..BF | |
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+1.0000..U+3.FFFF | F0 | ==90==..BF | 80..BF | 80..BF |
U+4.0000..U+F.FFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+10.0000..U+10.FFFF |
F4 | 80..==8F== | 80..BF | 80..BF |
In the above table, byte ranges for the second byte that are different from the usual range 80..BF
are highlighted. The structure of above table is a consequence of imposed code point range restrictions, as well as of mapping bits of a particular code point byte to resulting encoded bytes in such a way that the former are spread across byte boundaries.
- The stricter aura syntax, with only single trailing uppercase bit-width annotation allowed, is currently obeyed across all standard desks.
- The change in
+fitz
is not backward compatible. Code relying on casting to a compatible, but not nesting aura will crash. - The redefined
@ta
aura will cover a wider range of strings and is therefore backward-compatible. - With the change of default printing of user-defined auras, any code relying on it will malfunction.
~rovnys-ricfer proposed extending the aura syntax with minor part.
Copyright and related rights waived via CC0.