Skip to content

Latest commit

 

History

History
171 lines (123 loc) · 11.4 KB

UIP-0107.md

File metadata and controls

171 lines (123 loc) · 11.4 KB
uip title description author status type category created
71
Aura Renovation
Address inconsistencies in the aura type system
~ponmep-litsem
Draft
Standards Track
Hoon
2023-06-30

Abstract

The aura type system provides a way to indicate a particular interperation of an atom in Hoon. An aura of an atom influences both how it is presented to the user, as well as specifies its parsing routine. Currently, auras are used in an inconsistent manner, thus preventing enforcement of atom sanity at the compiler level. We address these inconsistencies together with related problems. With the implementation of this proposal auras become a correct type system tool, paving the way for compile-time atom sanity checking.

Motivation

There are several ways in which auras are used in inconsistent manner.

  1. The aura type itself, defined as @ta, violates the constraint on knots, which do not permit uppercase letters.

  2. Use of term type is too restrictive for an aura variable, which admits uppercase bit-width indicator. An example of this inconsistency is in the Hoon atom definition,

    +$  hoon
    ...
    [%rock p=term q=*]
    [%sand p=term q=*]
    

as well as in many functions throughout sys/hoon.hoon which accept term for what is an aura. This is seen in practice with tapes, defined as (list @tD). Another example is +fitz which accepts a term for a general aura.

  1. There are two Hoon literals which have uppercase letters in their digit set: @uc and @uw. The scot atom printer returns a @ta, and the slaw family of atom parsers similarly violates this same constraint, accepting a @ta as their string argument.

  2. The +slaw family of atom parsers accepts a @tas for an aura, thus restricting valid auras with bit-width indicator.

  3. Clay persists invalid paths according to the path specification as a list of @ta. These can either be some valid @uc, @uw atoms, or, more generally, any path addressing a file whose name contains uppercase letters.

These violations of the aura type system make it impossible to enforce sanity of atoms at compile-time, as originally envisioned with the +sane and +ruth arms and had already caused problems in %clay handling of paths and desks. Fixing the aura type system, as proposed here, is a required step before enabling compile-time atom sanity checks in the future.

Apart from the inconsistent use of the auras, there are also two related problems:

  • (A) +sane, so far not used at the compiler level, is under-specified and performs only some atom sanity checks
  • (B) Default hexadecimal printing of user-defined auras could become a source of unforseen bugs should this behaviour be changed in the future

Specification

  • The knot aura @ta denotes a string composed of allowed characters: [0-9], [a-z], [A-Z], '_', '-', '~', '.'. It is considered URL-safe, and allows the encoding of all Hoon constants.
  • Aura syntax. A valid aura consists of three parts prefixed by @: the major part, the bit-width (optional), and the minor part (optional). The major part is matched by [a-z]*. The bit-width is matched by [A-Z]. The minor part is matched by -[a-z]+. Examples of valid auras include @u, @uD, @uv-blob, @uxG-hash.
  • Aura nesting rules. Aura B nests in aura A if the following conditions are all satisfied. (1) Nesting in the major part. B nests in A if the major part of B is an extension of the major part of A. (2) Nesting in the bit-width. B nests in A if the bit-width of B is not greater than the bit-width of A. Absence of bit-width indicates unlimited size. (3) Nesting in the minor part. B nests in A if the minor part of B matches the minor part of A, or if A is the universal aura @.
  • Universal aura @. All atoms nest under the universal aura @. This is a consequence of specified nesting rules.
  • Aura compatibility. Auras A and B are said to be compatible if and only if A nests in B or B nests in A.
  • +fitz is defined to be an aura nest check.
  • The @c aura denotes a well-formed UTF-32 code unit sequence encoded in UTF-32LE encoding form, according to Definitions 88 and 90 of the Unicode standard. Detailed specification is found in the specification supplement.
  • @t aura denotes a well-formed UTF-8 code unit sequence encoded in UTF-8 encoding form, according to Definitions 86 and 92 of the Unicode standard. In addition, @t excludes the control code points 0x00..0x09 and 0x0B..0x1F, but includes the line feed code point 0x0A. Detailed specification is found in the specification supplement.
  • Conversion from a @c atom to a @t atom. Since a @c atom can assume values that represent control code points invalid for a @t atom, the conversion routine shall prevent creation of invalid @t atoms by crashing on invalid inputs.
  • The pretty-printer renders user-defined auras according to their nearest defined nest.

Rationale

@ta to allow uppercase letters

The exclusion of uppercase letters from the knot aura @ta specification seems unnecessary. Relaxation of this condition alone solves problems 1–5. There seems to be little justification for it at present, beyond enforcing strict URL safety. It is clear that such strictness can no longer be maintained.

Hoon literals can not be considered URL-safe unless uppercase letters are permitted. Overwhelming majority of websites is served by UNIX systems, which are naturally case sensitive. Clay too has to allow uppercase filenames.

We therefore propose to relax this restriction and allow uppercase letters in @ta, declaring it to be URL-safe.

New aura syntax and nesting rules

Major part nests only by extension.

The nesting rules for the major part are modified. Previously, the nesting rules in the major part functioned as compatibility check, as realized by the +fitz arm. This enables the programmer to produce atoms violating the sanity check.

> (fitz %ta %tas)
> %.y
> (fitz %tas %ta)
> %.y
> ^-(@tas 'MAD')
%MAD

Meawhile, in the bit-width +fitz functioned as a nest check proper.

> (fitz 'uE' 'uD')
> %.y
> (fitz 'uD' 'uE')
> %.n

We restore consistency by redefining +fitz as a one-directional aura nest check.

Bit-width syntax with alternative nesting behaviour

Currently, only the last uppercase letter is treated as a bit-width extension by the aura compatibility check arm +fitz. This means that an aura like @uDE denotes an aura of type @uD with the bit-width extension of E -- 16 bits. We propose to simplify things by further restricting the bit-width syntax: only a single, trailing uppercase letter is allowed in the aura specification. Thus an aura @uDE is invalid, while @uD and @uE are allowed.

Minor parts nest by exact match

The minor part of an aura allows us to introduce more descriptive aura types, to state the indented purpose of the aura, in addition to its semantics. For instance, @uxD-flag could be used when communicating with hardware devices, while @uv-blob could be used in place of the less readable @uvblob. The minor parts nest based on exact match, as opposed to major parts, which nest by extension. In particular, @uv-blob does not nest under @uv, but it does nest under @-blob. As a special case, the universal aura @ is considered to encompass auras, regardless of the minor part.

Example parser of the new syntax

An example implementation of the new +mota to be introduced by this proposal is presented below.

|=  a=tape
=<
(scan a mota)
|%
++  mota  ;~(pfix pat ;~(plug major (punt bit-width) (punt minor)))
++  major  (star low)
++  bit-width  hig
++  minor  ;~(plug hep (plus low))
--

Change in default printing of atoms

The default printing of user-defined auras as hexadecimal number without separators tempts the programmer to exploit this behaviour as a convenient way to print hexadecimal atoms. It is also inconsistent with the rendering of the empty aura @ as an unsigned decimal atom. Rather, user-defined auras should render in the same way as the @ aura, under which they nest.

Specification supplement

@c atom specification

In UTF-32LE encoding form, each Unicode scalar value is assigned a 32-bit number with the same value as the Unicode scalar value. This number is then encoded in a Hoon atom in little-endian byte order. A @c atom is a sequence of Unicode scalar values encoded in this manner.

The maximum Unicode scalar value is 0x10FFFF. Since surrogate code points are not in the set of Unicode scalar values, the code points range 0xD800..0xDFFF is excluded.

@t atom specification

In UTF-8 encoding form, each Unicode scalar value is assigned a sequence of one to four bytes, following Definition 92 of the Unicode Standard. A @t atom is a sequence of bytes resulting from encoding a sequence of Unicode scalar values in this manner.

Surrogate code points are likewise excluded. In addition, the control code points 0x00..0x09 and 0x0B..0x1F are excluded, but the line feed code point 0x0A is included.

The mapping between Unicode scalar values and UTF-8 byte sequences is as follows.

UTF-8 encoding of code points

Scalar Value First Byte Second Byte Third Byte Fourth Byte
0000.0000 0xxx.xxxx 0xxxxxxx
0000.0yyy yyxx.xxxx 110yyyyy 10xxxxxx
zzzz.yyyy yyxx.xxxx 1110zzzz 10yyyyyy 10xxxxxx
000u.uuuu zzzz.yyyy yyxx.xxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Well-formed UTF-8 byte sequences

Code Points First Byte Second Byte Third Byte Fourth Byte
U+000A, U+0020..U+007F 0A, 20..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 ==A0==..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..==9F== 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+1.0000..U+3.FFFF F0 ==90==..BF 80..BF 80..BF
U+4.0000..U+F.FFFF F1..F3 80..BF 80..BF 80..BF
U+10.0000..U+10.FFFF
F4 80..==8F== 80..BF 80..BF

In the above table, byte ranges for the second byte that are different from the usual range 80..BF are highlighted. The structure of above table is a consequence of imposed code point range restrictions, as well as of mapping bits of a particular code point byte to resulting encoded bytes in such a way that the former are spread across byte boundaries.

Backward Compatibility

  • The stricter aura syntax, with only single trailing uppercase bit-width annotation allowed, is currently obeyed across all standard desks.
  • The change in +fitz is not backward compatible. Code relying on casting to a compatible, but not nesting aura will crash.
  • The redefined @ta aura will cover a wider range of strings and is therefore backward-compatible.
  • With the change of default printing of user-defined auras, any code relying on it will malfunction.

Acknowledgements

~rovnys-ricfer proposed extending the aura syntax with minor part.

Copyright

Copyright and related rights waived via CC0.