-
Notifications
You must be signed in to change notification settings - Fork 1
/
ion-rfc-06-stringclob.nroff
237 lines (190 loc) · 9.55 KB
/
ion-rfc-06-stringclob.nroff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
.tm 6. Ion Strings & Clobs ............................................ \n%
.ti 0
6. Ion Strings & Clobs
This document clarifies the semantics of the Amazon Ion string and clob
data types with respect to escapes and the [Unicode][2] standard.
As of the date of this writing, the Unicode Standard is on
[version 10.0][3]. This specification is to that standard.
.tm 6.1 Unicode Primer ................................................ \n%
.ti 0
6.1. Unicode Primer
The Unicode standard specifies a large set of code points, the Universal
Character Set (UCS), which is an integer in the range of 0 (0x0) through
1,114,111 (0x10FFFF) inclusive. Throughout this document, the notation
U+HHHH and U+HHHHHHHH refer to the Unicode code point HHHH and HHHHHHHH
respectively as a hexadecimal ordinal. This notation follows the Unicode
standard convention.
Traditionally, from a programmer's perspective, a code point can be thought
of as a character, but there is sometimes a subtle distinction. For
example, in Java, the char type is an unsigned, 16-bit integer, which is
normally used to hold UTF-16 code units (e.g. [java.lang.CharSequence][4]).
For the Unicode code point, Mathematical Bold Capital "A" (code point
U+0001D400), this encoded in a UTF-16 string as two units: 0xD835 followed
by 0xDC00. So in this case, Java's UTF-16 representation actually utilizes
two character (i.e. char) values to represent one Unicode code point.
This document attempts to avoid using the term character when referring to
Unicode code points. The reasoning for this is partly stated above, but
also has to do with the overloaded nature of the term (e.g. a user
character or grapheme). For more details, consult section [3.4 of the
Unicode Standard][5].
Another interesting aspect of the UCS, is a block of code points that is
reserved exclusively for use in the UTF-16 encoding (i.e. surrogate code
points). As such, strictly speaking, no encoding of Unicode are allowed to
represent the code points in the inclusive range U+D800 to U+DFFF. In the
UTF-16 case, these code points are only allowed to be used in the encoding
to specify characters in the U+00010000 to U+0010FFFF range. Refer to
sections [3.8 and 3.9 of the Unicode Standard][5] for details.
.tm _ 6.1. Ion String ................................................ \n%
.ti 0
6.2. Ion String
The Ion String data type is a sequence of Unicode code points. The Ion
semantics of this are agnostic to any particular Unicode encoding (e.g.
UTF-16, UTF-8), except for the concrete syntax specification of the Ion
binary and text formats.
.tm _ 6.2.1. Text Format ......................................... \n%
.ti 0
6.2.1. Text Format
See the [grammar][6] for a formal definition of the Ion Text encoding for
the string type.
Multiple Ion long string literals that are adjacent to each other by zero
or more whitespace are concatenated automatically. For example the
following two blocks of Ion text syntax are semantically equivalent. Note
that short string literals do not exhibit this behavior.
.nf
"1234" '''Hello''' '''World'''
"1234" "HelloWorld"
.in 3
Each individual long string literal must be a valid Unicode character
sequence when unescaped. The following examples are invalid due to
splitting Unicode escapes, an escaped surrogate pair, and a common escape,
respectively.
.nf
'''\u''' '''1234'''
'''\U0000''' '''1234'''
'''\uD800''' '''\uDC00'''
'''\''' '''n'''
.in 3
Within long string literals unescaped newlines are normalized such that
U+000D U+000A pairs (CARRIAGE RETURN and LINE FEED respectively) and U+000D
are replaced with U+000A. This is to facilitate compatibility across
operating systems.
Normalization can be subverted by using a combination of escapes:
.nf
CARRIAGE RETURN only:
'''one\r\
two'''
CARRIAGE RETURN and LINE FEED:
'''one\r
two'''
.in 3
Escaped newlines are not replaced with any characters (i.e. the newline is
removed). In addition, the following table describes the string escape
sequences that have direct code point replacement for all quoted string and
symbol forms.
.nf
Unicode Code Point Ion Escape Semantics
U+0007 \\a BEL (alert)
U+0008 \\b BS (backspace)
U+0009 \\t HT (tab)
U+000A \\n LF (linefeed)
U+000C \\f FF (form feed)
U+000D \\r CR (carriage return)
U+000B \\v VT (vertical tab)
U+0022 \\" double quote
U+0027 \\' single quote
U+003F \\? question mark
U+005C \\\\ backslash
U+002F \\/ forward slash
U+0000 \\0 NUL (null character)
.in 3
The for the Unicode ordinal string escapes, \U, \u, and \\x, the escape must
be followed by a number of hexadecimal digits as described below.
.nf
Unicode Ion
Code Point Sequence Semantics
U+HHHHHHHH \UHHHHHHHH 8-digit hexadecimal Unicode code point
U+HHHH \uHHHH 4-digit hexadecimal Unicode code point;
equivalent to \U0000HHHH
U+00HH \\xHH 2-digit hexadecimal Unicode code point;
equivalent to \u00HH and \U000000HH
.in 3
Ion does not specify the behavior of specifying invalid Unicode code points
or surrogate code points (used only for UTF-16) using the escape sequences.
It is highly recommended that Ion implementations reject such escape
sequences as they are not proper Unicode as specified by the standard. To
this point, consider the Ion string sequence, "\uD800\uDC00". A compliant
parser may throw an exception because surrogate characters are specified
outside of the context of UTF-16, accept the string as a technically
invalid sequence of two Unicode code points (i.e. U+D800 and U+DC00), or
interpret it as the single Unicode code point U+00010000. In this regard,
the Ion string data type does not conform to the Unicode specification.
A strict Unicode implementation of the Ion text should not accept such
sequences.
.tm _ 6.2.2. Binary Format ....................................... \n%
.ti 0
6.2.2. Binary Format
The Ion binary format encodes the string data type directly as a sequence
of UTF-8 octets. A strict, Unicode compliant implementation of Ion should
not allow invalid UTF-8 sequences (e.g. surrogate code points, overlong
values, and values outside of the inclusive range, U+0000 to U+0010FFFF).
.tm 6.3 Ion Clob ...................................................... \n%
.ti 0
6.3. Ion Clob
An Ion clob type is similar to the blob type except that the denotation in
the Ion text format uses an ASCII-based string notation rather than a
base64 encoding to denote its binary value. It is important to make the
distinction that clob is a sequence of raw octets and string is a sequence
of Unicode code points.
.tm _ 6.3.1. Text Format ......................................... \n%
.ti 0
6.3.1. Text Format
See the [grammar][6] for a formal definition of the Ion Text encoding for
the clob type.
Similar to string, adjoining long string literals within an Ion clob are
concatenated automatically. Within a clob, only one short string literal or
multiple long string literals are allowed. For example, the following two
blocks of Ion text syntax are semantically equivalent.
.nf
{{ '''Hello''' '''World''' }}
{{ "HelloWorld" }}
.in 3
The rules for the quoted strings within a clob follow the similarly to the
string type, with the following exceptions. Unicode newline characters in
long strings and all verbatim ASCII characters are interpreted as their
ASCII octet values. Non-printable ASCII and non-ASCII Unicode code points
are not allowed unescaped in the string bodies. Furthermore, the following
table describes the clob string escape sequences that have direct octet
replacement for both all strings.
.nf
Octet Ion Escape Semantics
0x07 \\a ASCII BEL (alert)
0x08 \\b ASCII BS (backspace)
0x09 \\t ASCII HT (tab)
0x0A \\n ASCII LF (line feed)
0x0C \\f ASCII FF (form feed)
0x0D \\r ASCII CR (carriage return)
0x0B \\v ASCII VT (vertical tab)
0x22 \\" ASCII double quote
0x27 \\' ASCII single quote
0x3F \\? ASCII question mark
0x5C \\\\ ASCII backslash
0x2F \\/ ASCII forward slash
0x00 \\0 ASCII NUL (null character)
.in 3
The clob escape \\x must be followed by two hexadecimal digits. Note that
clob does not support the \u and \U escapes since it represents an octet
sequence and not a Unicode encoding.
.nf
Octet Ion Escape Semantics
0xHH \\xHH 2-digit hexadecimal octet
.in 3
It is important to note that clob is a binary type that is designed for
binary values that are either text encoded in a code page that is ASCII
compatible or should be octet editable by a human (escaped string syntax
vs. base64 encoded data). Clearly non-ASCII based encodings will not be
very readable (e.g. the clob for the EBCDIC encoded string representing
"hello" could be denoted as {{ "\\xc7\\xc1%%?" }}).
.tm _ 6.3.2. Binary Format ....................................... \n%
.ti 0
6.3.2. Binary Format
This is represented directly as the octet values in the clob value.