-
Notifications
You must be signed in to change notification settings - Fork 3
/
index.xml
2668 lines (2580 loc) · 271 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?><feed xmlns="http://www.w3.org/2005/Atom"><title>Pat Shaughnessy</title><id>http://patshaughnessy.net</id><updated>2022-02-19T18:22:06Z</updated><author><name>Pat Shaughnessy</name></author><entry><title>LLVM IR: The Esperanto of Computer Languages</title><link href="https://patshaughnessy.net/2022/2/19/llvm-ir-the-esperanto-of-computer-languages" rel="alternate"></link><id href="https://patshaughnessy.net/2022/2/19/llvm-ir-the-esperanto-of-computer-languages" rel="alternate"></id><published>2022-02-19T00:00:00Z</published><updated>2022-02-19T00:00:00Z</updated><category>Crystal</category><author><name>Pat Shaughnessy</name></author><summary type="html"><div style="float: left; padding: 8px 30px 0px 0px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2022/2/19/esperanto.png"><br/>
<i> Esperanto grammar is logical and self<br/>
consistent, designed to be easy to learn. <br/>
<small> <a title="Renatoeo, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https:/</summary><content type="html"><div style="float: left; padding: 8px 30px 0px 0px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2022/2/19/esperanto.png"><br/>
<i> Esperanto grammar is logical and self<br/>
consistent, designed to be easy to learn. <br/>
<small> <a title="Renatoeo, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:CARD_GRAM%C3%81TICA_ESPERANTO.png">via Wikimedia Commons</a></small> </i>
</div>
<p>I empathize for people who have to learn English as a foreign language. English
grammar is inconsistent, arbitrary and hard to master. English spelling is even
worse. I sometimes find myself apologizing for my language’s shortcomings. But
learning any foreign language as an adult is very difficult.</p>
<p><a href="https://en.wikipedia.org/wiki/Esperanto">Esperanto</a>, an “artificial language,”
is different. Invented by Ludwik Zamenhof in 1873, Esperanto has a vocabulary
and grammar that are logical and consistent, designed to be easier to learn.
Zamenhof intended Esperanto to become the universal second language.</p>
<p>Computers have to learn foreign languages too. Every time you compile and run
a program, your compiler translates your code into a foreign language: the
native machine language that runs on your target platform. Compilers should
have been called translators. And compilers struggle with the same things we
do: inconsistent grammar and vocabulary, and other peculiarities of the target
platform.</p>
<p>Recently, however, more and more compilers translate your code to an
artificial machine language. They produce a simpler, more consistent, more
powerful machine language that doesn’t actually run on any machine. This
artificial machine language, LLVM IR, makes writing compilers simpler and
reading the code compilers produce simpler too.</p>
<p>LLVM IR is becoming the universal second language for compilers.</p>
<h2>One Line of LLVM IR</h2>
<p>The <a href="https://llvm.org">Low Level Virtual Machine</a> (LLVM) project had the novel
idea of inventing a virtual machine that was easy for compiler engineers to use
as a target platform. The LLVM team designed a special instruction set called
<a href="https://llvm.org/docs/LangRef.html">intermediate representation</a> (IR). New,
modern languages such as Rust, Swift, Clang-based versions of C and many
others, first translate your code to LLVM IR. Then they use the LLVM framework
to convert the IR into actual machine language for any target platform LLVM
supports:</p>
<img style="width: 500px; margin-bottom: 20px" src="https://patshaughnessy.net/assets/2022/2/19/platforms.svg">
<p>LLVM is great for compilers. Compiler engineers don’t have to worry about the
detailed instruction set of each platform, and LLVM optimizes your code for
whatever platform you choose automatically. And LLVM is also great for people
like me who are interested in what machine language instructions look like and
how CPUs execute them. LLVM instructions are much easier to follow than real
machine instructions. Let’s take a look at one!</p>
<p>Here’s a line of LLVM IR I generated from a simple
<a href="https://crystal-lang.org">Crystal</a> program:</p>
<pre type="console">%57 = call %"Array(Int32)"* @"*Array(Int32)@Array(T)::unsafe_build<Int32>:Array(Int32)"(i32 610, i32 2), !dbg !89</pre>
<p>Wait a minute! This isn’t simple or easy to follow at all! What am I talking
about here? At first glance, this does look confusing. But as we’ll see, most
of the confusing syntax is related to Crystal, not LLVM. Studying this line of
code will reveal more about Crystal than it will about LLVM.</p>
<p>The rest of this article will unpack and explain what this line of code means.
It looks complex, but is actually quite simple.</p>
<h2>The Call Instruction</h2>
<p>The instruction above is a function call in LLVM IR. To produce this code, I
wrote a small Crystal program and then translated it using this command:</p>
<pre type="console">$ crystal build array_example.cr --emit llvm-ir</pre>
<p>The <code>--emit</code> option directed Crystal to generate a file called array_example.ll,
which contains the line above along with thousands of other lines. We’ll get to
the Crystal code in a minute. But for now, how do I get started understanding
what the LLVM code means?</p>
<p>The <a href="https://llvm.org/docs/LangRef.html">LLVM Language Reference Manual</a> has
documentation for <code>call</code> and all of the other LLVM IR instructions. Here’s the
syntax for <code>call</code>:</p>
<pre type="console">&lt;result> = [tail | musttail | notail ] call [fast-math flags] [cconv] [ret attrs] [addrspace(&lt;num>)]
&lt;ty>|&lt;fnty> &lt;fnptrval>(&lt;function args>) [fn attrs] [ operand bundles ]</pre>
<p>My example <code>call</code> instruction doesn’t use many of these options. Removing the
unused options, I can see the actual, basic syntax of <code>call</code>:</p>
<pre type="console">&lt;result> = call &lt;ty> &lt;fnptrval>(&lt;function args>)</pre>
<p>In order from left to right, these values are:</p>
<ul>
<li>
<p><code>&lt;result&gt;</code> which register to save the result in</p>
</li>
<li>
<p><code>&lt;ty&gt;</code> the type of the return value</p>
</li>
<li>
<p><code>&lt;fnptrval&gt;</code> a pointer to the function to call</p>
</li>
<li>
<p><code>&lt;function args&gt;</code> the arguments to pass to that function</p>
</li>
</ul>
<p>What does all of this mean, exactly? Let’s find out!</p>
<h2>A CPU With Infinite Registers</h2>
<p>Starting on the left and moving right, let’s step through the <code>call</code> instruction:</p>
<img src="https://patshaughnessy.net/assets/2022/2/19/result.svg">
<p>The token <code>%57</code> to the left of the equals sign tells LLVM where to save the
return value of the function call that follows. This isn’t a normal variable;
<code>%57</code> is an LLVM “register.”</p>
<p>Registers are physical circuits located on microprocessor chips used to save
intermediate values. Saving a value in a CPU register is much faster than
saving a value in memory, since the register is located on the same chip as the
rest of the microprocessor. Saving a value in RAM memory, on the other hand,
requires transmitting that value from one chip to another and is much slower,
relatively speaking. Unfortunately, each CPU has a limited number of registers
available, and so compilers have to decide which values are used frequently
enough to warrant saving in nearby registers, and which other values can be
moved out to more distant memory.</p>
<p>Unlike the limited number of registers available on a real CPU, the imaginary
LLVM microprocessor has an infinite number of them. Because of this, compilers
that target LLVM can simply save values to a register whenever they would like.
There’s no need to find an available register, or to move an existing value out
of a register first before using it for something else. Busy work that normal
machine language code can’t avoid.</p>
<p>In this program, the Crystal compiler had already saved 56 other values in
“registers” and so for this line of LLVM IR, Crystal simply used the next
register, number 57.</p>
<h2>LLVM Structure Types</h2>
<p>Moving left to right, LLVM <code>call</code> instructions next indicate the type of the
function call’s return value:</p>
<img src="https://patshaughnessy.net/assets/2022/2/19/type.svg">
<p>This name of this type, <code>Array(Int32)</code>, is generated by the Crystal compiler, not
by LLVM. That is, this is a type from my Crystal program. It could have been
anything, and indeed other compilers that target LLVM will generate completely
different type names.</p>
<p>The example Crystal program I used to generate this LLVM code was:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]
</span><span style="color:#000000;">puts arr[</span><span style="color:#d08770;">1</span><span style="color:#000000;">]</span></pre>
<p>When I compiled this program, Crystal generated the <code>call</code> instruction above,
which returns a pointer to the new array, <code>arr</code>. Since <code>arr</code> is an array
containing integers, Crystal uses a generic type <code>Array(Int32).</code></p>
<p>Machine languages that target real machines only support hardware types that
machine supports. For example, Intel x86 assembly language allows you to save
integers of different widths, 16, 32 or 64 bits for example, and an Intel x86
CPU has registers designed to hold values of each of these sizes.</p>
<p>LLVM IR is more powerful. It supports “structure types,” similar to a C
structure or an object in a language like Crystal or Swift. Here the <code>%&quot;…&quot;</code>
syntax indicates the name inside the quotes is the name of a structure type.
And the asterisk which follows, like in C, indicates the type of the return
value of my function call is a pointer to this structure.</p>
<p>My example LLVM program defines the type <code>Array(Int32)</code> like this:</p>
<pre type="console">%"Array(Int32)" = type { i32, i32, i32, i32, i32* }</pre>
<p>Structure types allow LLVM IR programs to create pointers to structures or
objects, and to access any of the values inside each object. That makes writing
a compiler much easier. In my example, the call instruction returns a pointer
to an object which contains 4 32-bit integer values, followed by a pointer to
other 32 integer values. But what are all of these integer values? Above I said
this function call was returning a new array - how can that be the case?</p>
<p>LLVM itself has no idea, and no opinion on the matter. To understand what these
values are, and what they have to do with the array in my program, we need to
learn more about the Crystal compiler that generated this LLVM IR code.</p>
<p>Reading the <a href="https://github.com/crystal-lang/crystal/blob/master/src/array.cr#L48">Crystal standard
library</a>,
we can see Crystal implements arrays like this:</p>
<pre style="background-color:#ffffff;">
<span style="color:#a71d5d;">class </span><span style="color:#008080;">Array</span><span style="color:#000000;">(T)
</span><span style="color:#795da3;">include </span><span style="color:#008080;">Indexable</span><span style="color:#000000;">::Mutable(T)
</span><span style="color:#795da3;">include </span><span style="color:#000000;">Comparable(Array)
</span><span style="color:#000000;">
</span><span style="color:#a7adba;"># Size of an Array that we consider small to do linear scans or other optimizations.
</span><span style="color:#795da3;">private </span><span style="color:#000000;">SMALL_ARRAY_SIZE </span><span style="color:#4f5b66;">= </span><span style="color:#d08770;">16
</span><span style="color:#000000;">
</span><span style="color:#a7adba;"># The size of this array.
</span><span style="color:#4f5b66;">@</span><span style="color:#000000;">size </span><span style="color:#4f5b66;">: </span><span style="color:#000000;">Int32
</span><span style="color:#000000;">
</span><span style="color:#a7adba;"># The capacity of `@buffer`.
</span><span style="color:#a7adba;"># Note that, because `@buffer` moves on shift, the actual
</span><span style="color:#a7adba;"># capacity (the allocated memory) starts at `@buffer - @offset_to_buffer`.
</span><span style="color:#a7adba;"># The actual capacity is also given by the `remaining_capacity` internal method.
</span><span style="color:#4f5b66;">@</span><span style="color:#000000;">capacity </span><span style="color:#4f5b66;">: </span><span style="color:#000000;">Int32
</span><span style="color:#000000;">
</span><span style="color:#a7adba;"># Offset to the buffer that was originally allocated, and which needs to
</span><span style="color:#a7adba;"># be reallocated on resize. On shift this value gets increased, together with
</span><span style="color:#a7adba;"># `@buffer`. To reach the root buffer you have to do `@buffer - @offset_to_buffer`,
</span><span style="color:#a7adba;"># and this is also provided by the `root_buffer` internal method.
</span><span style="color:#4f5b66;">@</span><span style="color:#000000;">offset_to_buffer </span><span style="color:#4f5b66;">: </span><span style="color:#000000;">Int32 </span><span style="color:#4f5b66;">= </span><span style="color:#d08770;">0
</span><span style="color:#000000;">
</span><span style="color:#a7adba;"># The buffer where elements start.
</span><span style="color:#4f5b66;">@</span><span style="color:#000000;">buffer </span><span style="color:#4f5b66;">: </span><span style="color:#000000;">Pointer(T)
</span><span style="color:#000000;">
</span><span style="color:#a7adba;"># In 64 bits the Array is composed then by:
</span><span style="color:#a7adba;"># - type_id : Int32 # 4 bytes -|
</span><span style="color:#a7adba;"># - size : Int32 # 4 bytes |- packed as 8 bytes
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># - capacity : Int32 # 4 bytes -|
</span><span style="color:#a7adba;"># - offset_to_buffer : Int32 # 4 bytes |- packed as 8 bytes
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># - buffer : Pointer # 8 bytes |- another 8 bytes</span></pre>
<p>The comments above are very illustrative and complete - the Crystal team took
the time to document their standard library and explain not only how to use
each class, like <code>Array(T)</code>, but how they are implemented internally.</p>
<p>In this case, we can see the four <code>i32</code> values inside the <code>Array(Int32)</code> LLVM
structure type hold the size and capacity off the array, among other things.
And the <code>i32*</code> value is a pointer to the actual contents of the array.</p>
<h2>Functions</h2>
<p>The target of the call instruction appears next, after the return type:</p>
<img src="https://patshaughnessy.net/assets/2022/2/19/function.svg">
<p>This is quite a mouthful! What sort of function is this?</p>
<p>There are two steps to understanding this: First, the <code>@&quot;…&quot;</code> syntax. This is
simply a global identifier in this LLVM program. So my <code>call</code> instruction is just
calling a global function. In LLVM programs, all functions are global; there is
no concept of a class, module or similar groupings of code.</p>
<p>But what in the world does that crazy identifier mean?</p>
<p>LLVM ignores this complex name. For LLVM this is just a name like <code>foo</code> or <code>bar</code>.
But for Crystal, the name has much more significance. Crystal encoded a lot of
information into this one name. Crystal can do this because the LLVM code isn’t
intended for anyone to read directly. Crystal has created a “mangled name,”
meaning the original version of the function to call is there but it’s been
mangled or rewritten in a confusing manner.</p>
<p>Crystal rewrites function names to ensure they are unique. In Crystal, like in
many other statically typed languages, functions with different argument types
or return value types are actually different functions. So in Crystal if I
write:</p>
<pre style="background-color:#ffffff;">
<span style="color:#a71d5d;">def </span><span style="color:#795da3;">foo</span><span style="color:#000000;">(a : Int32)
</span><span style="color:#000000;">puts </span><span style="color:#4f5b66;">&quot;</span><span style="color:#008080;">Int: </span><span style="color:#000000;">#{a}</span><span style="color:#4f5b66;">&quot;
</span><span style="color:#a71d5d;">end
</span><span style="color:#000000;">
</span><span style="color:#a71d5d;">def </span><span style="color:#795da3;">foo</span><span style="color:#000000;">(a : String)
</span><span style="color:#000000;">puts </span><span style="color:#4f5b66;">&quot;</span><span style="color:#008080;">String: </span><span style="color:#000000;">#{a}</span><span style="color:#4f5b66;">&quot;
</span><span style="color:#a71d5d;">end
</span><span style="color:#000000;">
</span><span style="color:#000000;">foo(</span><span style="color:#d08770;">123</span><span style="color:#000000;">)
</span><span style="color:#a7adba;">#=&gt; Int: 123
</span><span style="color:#000000;">foo(</span><span style="color:#4f5b66;">&quot;</span><span style="color:#008080;">123</span><span style="color:#4f5b66;">&quot;</span><span style="color:#000000;">)
</span><span style="color:#a7adba;">#=&gt; String: 123</span></pre>
<p>…I have two separate, different functions both called <code>foo</code>. The type of the
parameter <code>a</code> distinguishes one from the other.</p>
<p>Crystal generates unique function names by encoding the arguments, return value
and type of the receiver into the into the function name string, making it
quite complex. Let’s break it down:</p>
<img src="https://patshaughnessy.net/assets/2022/2/19/mangled.svg">
<ul>
<li>
<p><code>Array(Int32)@Array(T)</code> - this is the type of the receiver. That means the
<code>unsafe_build</code> function is actually a method on the <code>Array(T)</code> generic class.
And in this case, the receiver is an array holding 32 bit integers, the
<code>Array(Int32)</code> class. Crystal includes both names in the mangled function name.</p>
</li>
<li>
<p><code>unsafe_build</code> - this is the function Crystal is calling.</p>
</li>
<li>
<p><code>Int32</code> - these are the function’s parameter types. In this case, Crystal is
passing in a single integer, so we just see one <code>Int32</code> type.</p>
</li>
<li>
<p><code>Array(Int32)</code> - this is the return value type, a new array containing integers.</p>
</li>
</ul>
<p>As I discussed in <a href="https://patshaughnessy.net/2022/1/22/visiting-an-abstract-syntax-tree">my last
post</a>,
the Crystal compiler internally rewrites my array literal expression <code>[12345, 67890]</code> into code that creates and initializes a new array object:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">__temp_621 </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">::Array(typeof(</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">)).unsafe_build(</span><span style="color:#d08770;">2</span><span style="color:#000000;">)
</span><span style="color:#000000;">__temp_622 </span><span style="color:#4f5b66;">=</span><span style="color:#000000;"> __temp_621.to_unsafe
</span><span style="color:#000000;">__temp_622[</span><span style="color:#d08770;">0</span><span style="color:#000000;">] </span><span style="color:#4f5b66;">= </span><span style="color:#d08770;">12345
</span><span style="color:#000000;">__temp_622[</span><span style="color:#d08770;">1</span><span style="color:#000000;">] </span><span style="color:#4f5b66;">= </span><span style="color:#d08770;">67890
</span><span style="color:#000000;">__temp_621</span></pre>
<p>In this expanded code, Crystal calls <code>unsafe_build</code> and passes in <code>2</code>, the
required capacity of the new array. And to distinguish this use of
<code>unsafe_build</code> from other <code>unsafe_build</code> functions that might exist in my
program, the compiler generated the mangled name we see above. </p>
<h2>Arguments</h2>
<p>Finally, after the function name the LLVM IR instruction shows the arguments
for the function call:</p>
<img src="https://patshaughnessy.net/assets/2022/2/19/args.svg">
<p>LLVM IR uses parentheses, like most languages, to enclose the arguments to a
function call. And the types precede each value: <code>610</code> is a 32-bit integer and
<code>2</code> is also a 32-bit integer.</p>
<p>But wait a minute! We saw just above the expanded Crystal code for generating
the array literal passes a single value, <code>2</code>, into the call to <code>unsafe_build</code>.
And looking at the mangled function name above, we also see there is a single
<code>i32</code> parameter to the function call.</p>
<p>But reading the LLVM IR code we can see a second value is also passed in:
<code>610</code>. What in the world does <code>610</code> mean? I don’t have 610 elements in my new
array, and 610 is not one of the array elements. So what is going on here?</p>
<p>Crystal is an object oriented language, meaning that each function is
optionally associated with a class. In OOP parlance, we say that we are
“sending a message” to a “receiver.” In this case, <code>unsafe_build</code> is the message,
and <code>::Array(typeof(12345, 67890))</code> is the receiver. In fact, this function is
really a class method. We are calling <code>unsafe_build</code> on the <code>Array(Int32)</code> class,
not on an instance of one array.</p>
<p>Regardless, LLVM IR does’t support classes or instance methods or class
methods. In LLVM IR, we only have simple, global functions. And indeed, the
LLVM virtual machine doesn’t care what these arguments are or what they mean.
LLVM doesn’t encode the meaning or purpose of each argument; it just does what
the Crystal compiler tells it to do.</p>
<p>But Crystal, on the other hand, has to implement object oriented behavior
somehow. Specifically, the <code>unsafe_build</code> function needs to behave differently
depending on which class it was called for, depending on what the receiver is.
For example:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">::Array(typeof(</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">)).unsafe_build(</span><span style="color:#d08770;">2</span><span style="color:#000000;">)</span></pre>
<p>… has to return an array of two integers. While:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">::Array(typeof(</span><span style="color:#4f5b66;">&quot;</span><span style="color:#008080;">abc</span><span style="color:#4f5b66;">&quot;</span><span style="color:#000000;">, </span><span style="color:#4f5b66;">&quot;</span><span style="color:#008080;">def</span><span style="color:#4f5b66;">&quot;</span><span style="color:#000000;">)).unsafe_build(</span><span style="color:#d08770;">2</span><span style="color:#000000;">)</span></pre>
<p>…has to return an array of two strings. How does this work in the LLVM IR code?</p>
<p>To implement object oriented behavior, Crystal passes the receiver as a hidden,
special argument to the function call:</p>
<img src="https://patshaughnessy.net/assets/2022/2/19/args2.svg">
<p>This receiver argument is a reference or pointer to the receiver’s object, and
is normally known as <code>self</code>. Here <code>610</code> is a reference or tag corresponding to
the <code>Array(Int32)</code> class, the receiver. And <code>2</code> is the actual argument to the
<code>unsafe_build</code> method.</p>
<p>Reading the LLVM IR code, we’ve learned that Crystal secretly passes a hidden
<code>self</code> argument to every method call to an object. Then inside each method, the
code has access to <code>self</code>, to the object instance that code is running for. Some
languages, like Rust, require us to pass <code>self</code> explicitly in each method call;
in Crystal this behavior is automatic and hidden.</p>
<h2>Learning How Compilers Work</h2>
<p>LLVM IR is a simple language designed for compiler engineers. I think of it
like a blank slate for them to write on. Most LLVM instructions are quite
simple and easy to understand; as we saw above, understanding the basic syntax
of the call instruction wasn’t hard at all.</p>
<p>The hard part was understanding how the Crystal compiler, which targets LLVM
IR, generates code. The LLVM syntax itself was easy to follow; it was the
Crystal language’s implementation that was harder to understand.</p>
<p>And this is the real reason to learn about LLVM IR syntax. If you take the time
to learn how LLVM instructions work, then you can start to read the code your
favorite language’s compiler generates. And once you can do that, you can learn
more about how your favorite compiler works, and what your programs actually do
when you run them.</p>
</content></entry><entry><title>Visiting an Abstract Syntax Tree</title><link href="https://patshaughnessy.net/2022/1/22/visiting-an-abstract-syntax-tree" rel="alternate"></link><id href="https://patshaughnessy.net/2022/1/22/visiting-an-abstract-syntax-tree" rel="alternate"></id><published>2022-01-22T00:00:00Z</published><updated>2022-01-22T00:00:00Z</updated><category>Crystal</category><author><name>Pat Shaughnessy</name></author><summary type="html"><div style="float: left; padding: 8px 30px 0px 0px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2022/1/22/visit-tree.jpg"><br/>
<i>Joshua Tree National Park
<small>(via: <a href="https://commons.wikimedia.org/wiki/File:Backpacker_at_Sunset_(22849298523).jpg">Wikimedia Commons</a>)</small>
</i>
</div>
<p>In my <a href="https://patshaughnessy.net/2021/1</summary><content type="html"><div style="float: left; padding: 8px 30px 0px 0px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2022/1/22/visit-tree.jpg"><br/>
<i>Joshua Tree National Park
<small>(via: <a href="https://commons.wikimedia.org/wiki/File:Backpacker_at_Sunset_(22849298523).jpg">Wikimedia Commons</a>)</small>
</i>
</div>
<p>In my <a href="https://patshaughnessy.net/2021/12/22/reading-code-like-a-compiler">last
post</a>, I
explored how <a href="https://crystal-lang.org">Crystal</a> parsed a simple program and
produced a data structure called an <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">abstract syntax
tree</a> (AST). But what does
Crystal do with the AST? Why bother going to such lengths to create it?</p>
<p>After Crystal parses my code, it repeatedly steps through all the entries or
nodes in the AST and builds up a description of the intended meaning and
behavior of my code. This process is known as <em>semantic analysis</em>. Later,
Crystal will use this description to convert my program into a machine language
executable.</p>
<p>But what does this description contain? What does it really mean for a compiler
to <em>understand</em> anything? Let’s pack our bags and visit an abstract syntax tree
with Crystal to find out.</p>
<div style="clear: both"></div>
<h2>The Visitor Pattern</h2>
<p>Imagine several tourists visiting a famous tree: Each of them sees the same
tree in a different way. The tree doesn’t change, but the perspective of each
person looking at it is different. They each take a different photo, or
remember different details.</p>
<p>In Computer Science this separation of the data structure (the tree) from the
algorithms using it (the tourists) is known as the <a href="https://en.wikipedia.org/wiki/Visitor_pattern">visitor
pattern</a>. This technique allows
compilers and other programs to run multiple algorithms on the same data
structure without making a mess.</p>
<p>The visitor pattern calls for two functions: <code>accept</code> and <code>visit</code>. First, a
node in the data structure “accepts” a visitor:</p>
<img class="svg" width="400px" src="https://patshaughnessy.net/assets/2022/1/22/visitor1.svg">
<p>After accepting a visitor, the node turns around and calls the <code>visit</code> method on
<code>Visitor</code>:</p>
<img class="svg" width="400px" src="https://patshaughnessy.net/assets/2022/1/22/visitor2.svg">
<p>The <code>visit</code> method implements whatever algorithm that visitor is interested in.</p>
<p>This seems kind of pointless… why use <code>accept</code> at all? We could just call
<code>visit</code> directly. The key is that, after calling the visitor and passing
itself, the node passes the visitor to each of its children, recursively:</p>
<img class="svg" width="400px" src="https://patshaughnessy.net/assets/2022/1/22/visitor3.svg">
<p>And then the visitor can visit each of the child nodes also. The <code>Visitor</code>
class doesn’t necessarily need to know anything about how to navigate the node
data structure. And more and more visitor classes can implement new algorithms
without changing the underlying data structure and breaking each other.</p>
<h2>The Visitor Pattern in the Crystal Compiler</h2>
<p>In order to understand what my code means, Crystal reads through my program’s
AST over and over again using different visitors. Each algorithm looks for
certain syntax, records information about the types and objects my code uses or
possibly even transforms my code into a different form.</p>
<div style="float: right; padding: 8px 0px 0px 30px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2022/1/22/angel-oak.jpg"><br/>
<i>A photo I took in 2018 of <a href="https://en.wikipedia.org/wiki/Angel_Oak">Angel Oak</a>,<br/> a 400-500 year old tree in South Carolina.</i>
</div>
<p>Crystal implements the basics of the visitor pattern in
<a href="https://github.com/crystal-lang/crystal/blob/master/src/compiler/crystal/syntax/visitor.cr#L24">visitor.cr</a>,
inside the superclass of all AST nodes:</p>
<pre style="background-color:#ffffff;">
<span style="color:#a71d5d;">class </span><span style="color:#008080;">ASTNode
</span><span style="color:#343d46;"> </span><span style="color:#a71d5d;">def </span><span style="color:#795da3;">accept</span><span style="color:#000000;">(visitor)
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">if</span><span style="color:#000000;"> visitor.visit_any self
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">if</span><span style="color:#000000;"> visitor.visit self
</span><span style="color:#000000;"> accept_children visitor
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">end
</span><span style="color:#000000;"> visitor.end_visit self
</span><span style="color:#000000;"> visitor.end_visit_any self
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">end
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">end
</span><span style="color:#a71d5d;">end</span></pre>
<p>Each subclass of <code>ASTNode</code> implements its own version of <code>accept_children</code>.</p>
<h2>My Tiny Crystal Program</h2>
<p>To get a sense of how the visitor pattern works inside of Crystal, let’s look
at one line of code from my last post:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]</span></pre>
<p>As I explained last month, the Crystal parser generates this AST tree fragment:</p>
<img class="svg" width="400px" src="https://patshaughnessy.net/assets/2022/1/22/ast1.svg">
<p>Once the parser is finished and has created this small tree, the Crystal
compiler steps through it a number of different times, looking for classes,
variables, type declarations, etc. Each of these passes through the AST is
performed by a different visitor class: <code>TopLevelVisitor</code>,
<code>InstanceVarsInitializerVisitor</code> or <code>ClassVarsInitializerVisitor</code> among many
others.</p>
<p>The most important visitor class the Crystal compiler uses is called simply
<code>MainVisitor</code>. You can find the code for <code>MainVisitor</code> in
<a href="https://github.com/crystal-lang/crystal/blob/master/src/compiler/crystal/semantic/main_visitor.cr#L26">main_visitor.cr</a>:</p>
<pre style="background-color:#ffffff;">
<span style="color:#a7adba;"># This is the main visitor of the program, ran after types have been declared
</span><span style="color:#a7adba;"># and their type declarations (like `@x : Int32`) have been processed.
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># This visits the &quot;main&quot; code of the program and resolves calls, instantiates
</span><span style="color:#a7adba;"># methods and visits them, recursively, with other MainVisitors.
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># The visitor keeps track of a method&#39;s variables (or the main program, split into
</span><span style="color:#a7adba;"># several files, in case of top-level code). It keeps track both of the type of a
</span><span style="color:#a7adba;"># variable at a single point (stored in @vars) and the combined type of all assignments
</span><span style="color:#a7adba;"># to it (in @meta_vars).
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># Call resolution logic is in `Call#recalculate`, where method lookup is done.
</span><span style="color:#a71d5d;">class </span><span style="color:#008080;">MainVisitor </span><span style="color:#343d46;">&lt; </span><span style="color:#008080;">SemanticVisitor</span></pre>
<p>Since Crystal supports typed parameters and method overloading, the visitor
class implements a different <code>visit</code> method for each type of node that it visits,
for example:</p>
<pre style="background-color:#ffffff;">
<span style="color:#a71d5d;">class </span><span style="color:#008080;">MainVisitor </span><span style="color:#343d46;">&lt; </span><span style="color:#008080;">SemanticVisitor
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">def </span><span style="color:#795da3;">visit</span><span style="color:#000000;">(node : Assign)
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">def </span><span style="color:#795da3;">visit</span><span style="color:#000000;">(node : Var)
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">def </span><span style="color:#795da3;">visit</span><span style="color:#000000;">(node : ArrayLiteral)
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">def </span><span style="color:#795da3;">visit</span><span style="color:#000000;">(node : NumberLiteral)
</span><span style="color:#000000;">
</span><span style="color:#000000;">Etc…</span></pre>
<p>Now let’s look at three examples of what the <code>MainVisitor</code> class does with my
code: identifying variables, assigning types and expanding array literals. The
Crystal compiler is much too complex to describe in a single blog post, but
hopefully I can give you glimpse into the sort of work Crystal does during
semantic analysis.</p>
<h2>Identifying Variables</h2>
<p>Obviously, my example code creates and initializes a variable called <code>arr</code>:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]</span></pre>
<p>But how does Crystal identify this variable’s name and value? What does it do
with <code>arr</code>?</p>
<p>The <code>MainVisitor</code> class starts to process my code by visiting the root node of
this branch of my AST, the <code>Assign</code> node:</p>
<img class="svg" width="375px" src="https://patshaughnessy.net/assets/2022/1/22/visit-assign1.svg">
<p>As you can see, earlier during the parsing phrase Crystal had saved the target
variable and value of this assign statement in the child AST nodes. The target
variable, <code>arr</code>, appears in the <code>Var</code> node, and the value to assign is an
<code>ArrayLiteral</code> node. The <code>MainVisitor</code> now knows I declared a new variable, called
<code>arr</code>, in the current lexical scope. Since my program has no classes, methods or
any other lexical scopes, Crystal saves this variable in a table of variables
for the top level program:</p>
<img class="svg" width="300px" src="https://patshaughnessy.net/assets/2022/1/22/table.svg">
<p>Actually, to be more accurate, there will always be many other variables in
this table along with <code>arr</code>. All Crystal programs automatically include the
standard library, so Crystal also saves all of the top level variables from the
standard library in this table.</p>
<p>In a more normal program, there will be many lexical scopes for different
method and class or module definitions, and <code>MainVisitor</code> will save each
variable in the corresponding table.</p>
<h2>Assigning Types</h2>
<p>Probably the most important function of <code>MainVisitor</code> is to assign a type to each
value in my program. The simplest example of this is when <code>MainVisitor</code> visits a
<code>NumberLiteral</code> node:</p>
<img class="svg" width="300px" src="https://patshaughnessy.net/assets/2022/1/22/visit-number-literal.svg">
<p>Looking at the size of the numeric value, Crystal determines the type should be
<code>Int32</code>. Crystal then saves this type right inside of the <code>NumberLiteral</code> node:</p>
<img class="svg" width="114px" src="https://patshaughnessy.net/assets/2022/1/22/updated-number-literal.svg">
<p>Strictly speaking, this violates the visitor pattern because the visitors
shouldn’t be modifying the data structure they visit. But the type of each
node, the type of each programming construct in my program, is really an
integral part of that node. In this case the <code>MainVisitor</code> is really just
completing each node. It’s not changing the structure of the AST in this case…
although as we’ll see in a minute the <code>MainVisitor</code> does this for other nodes!</p>
<h2>Type Inference</h2>
<p>Sometimes type values can’t be determined from the intrinsic value of an AST
node. Often the type of a node is determined by other nodes in the AST.</p>
<p>Recall my example line of code is:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]</span></pre>
<p>Here Crystal automatically sets the type of the arr variable to the type of the
array literal expression: <code>Array(Int32)</code>. In Computer Science, this is known as
<em>type inference</em>. Because Crystal can automatically determine the type of
<code>arr</code>, I don’t need to declare it explicitly by writing something more
complicated like this:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">=</span><span style="color:#000000;"> uninitialized Array(Int32)
</span><span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]</span></pre>
<p>Type inference allows me to write concise, clean code with fewer type
annotations. Most modern, statically typed languages such as Swift, Rust,
Julia, Kotlin, etc., use type inference in the same way as Crystal. Even newer
versions of Java or C++ use type inference.</p>
<p>The Crystal compiler implements type inference when the MainVisitor encounters
an <code>Assign</code> AST node, what we saw above.</p>
<img class="svg" width="375px" src="https://patshaughnessy.net/assets/2022/1/22/visit-assign1.svg">
<p>After encountering the <code>Assign</code> node, Crystal recursively processes one of the
two child nodes, the <code>ArrayLiteral</code> value, and its child nodes. When this process
finishes, Crystal knows the type of the <code>ArrayLiteral</code> node is <code>Array(Int32)</code>:</p>
<img class="svg" width="425px" src="https://patshaughnessy.net/assets/2022/1/22/set-type.svg">
<p>I’ll take a closer look at how Crystal processes the <code>ArrayLiteral</code> node next.
But for now, once Crystal has the type of the <code>ArrayLiteral</code> node it copies that
type over to the <code>Var</code> node and sets its type also:</p>
<img class="svg" width="425px" src="https://patshaughnessy.net/assets/2022/1/22/set-type2.svg">
<p>But Crystal does something else interesting here: It sets up a dependency
between the two AST nodes: it “binds” the variable to the value:</p>
<img class="svg" width="325px" src="https://patshaughnessy.net/assets/2022/1/22/bind.svg">
<p>This binding dependency allows Crystal to later update the type of the <code>arr</code>
variable whenever necessary. In this case the value <code>[12345, 67890]</code> will always
have the same type, but I suspect that sometimes the Crystal compiler can
update types during semantic analysis. In this way if the Crystal compiler ever
changed its mind about the type of some value, it can easy update the types of
any dependent values. I also suspect Crystal uses these type dependency
connections to produce error messages whenever you pass an incorrect type to
some function, for example. These are just guesses, however; if anyone from the
Crystal team knows exactly what these type bindings are used for let me know.</p>
<p><b>Update:</b> Ary Borenszweig explained that sometimes the Crystal compiler
updates the type of variables based on how they are used. He posted an
interesting example on <a href="https://forum.crystal-lang.org/t/visiting-an-abstract-syntax-tree/4304">The Crystal Programming Language
Forum</a>.</p>
<h2>Expanding an Array Literal</h2>
<p>So far we’ve seen Crystal set the type of the <code>NumberLiteral</code> node to <code>Int32</code>,
and we’ve seen Crystal assign <code>arr</code> a type of <code>Array(Int32)</code>. But how did Crystal
determine the type of the array literal <code>[12345, 67890]</code>?</p>
<p>This is where things get even more complicated. Sometimes during semantic
analysis the Crystal compiler completely rewrites parts of your code, replacing
it with something else. This happens even with my simple example. When visiting
the <code>ArrayLiteral</code> node, the <code>MainVisitor</code> expands this simple line of code into
something more complex:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">__temp_621 </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">::Array(typeof(</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">)).unsafe_build(</span><span style="color:#d08770;">2</span><span style="color:#000000;">)
</span><span style="color:#000000;">__temp_622 </span><span style="color:#4f5b66;">=</span><span style="color:#000000;"> __temp_621.to_unsafe
</span><span style="color:#000000;">__temp_622[</span><span style="color:#d08770;">0</span><span style="color:#000000;">] </span><span style="color:#4f5b66;">= </span><span style="color:#d08770;">12345
</span><span style="color:#000000;">__temp_622[</span><span style="color:#d08770;">1</span><span style="color:#000000;">] </span><span style="color:#4f5b66;">= </span><span style="color:#d08770;">67890
</span><span style="color:#000000;">__temp_621</span></pre>
<p>Reading this, you can see how later my compiled program will create the new
array. First Crystal creates an empty array with a capacity of 2, and an
element type of <code>Int32</code>. <code>typeof(12345, 67890)</code> returns the type (or multiple
types inside a union type) found in the given set of values, in this case just
<code>Int32</code>. Later Crystal sets the two elements in the array just by assigning
them.</p>
<p>Crystal achieves this by replacing part of my program’s AST with a new branch:</p>
<img class="svg" width="375px" src="https://patshaughnessy.net/assets/2022/1/22/expanded-ast.svg">
<p>For clarity, I’m not drawing the AST nodes for the inner assign operations,
only the first line:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">__temp_621 </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">::Array(typeof(</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">)).unsafe_build(</span><span style="color:#d08770;">2</span><span style="color:#000000;">)</span></pre>
<h2>Putting It All Together</h2>
<p>With this new, updated AST we can see exactly how Crystal determines the type
of my variable <code>arr</code>. Starting at the root of my AST, <code>MainVisitor</code> visits all of
the AST nodes in this order in a series of recursive calls:</p>
<img class="svg" width="114px" src="https://patshaughnessy.net/assets/2022/1/22/call-recurse.svg">
<p>And it determines the types of each of these nodes as it returns from the
recursive calls:</p>
<img class="svg" width="240px" src="https://patshaughnessy.net/assets/2022/1/22/return-recurse.svg">
<p>Some interesting details here that I don’t understand completely or have space
to explain here:</p>
<ul>
<li>
<p>The <code>TypeOf</code> node calculates a common union type using a type formula. In this
example, it just returns <code>Int32</code> because both elements of my array, <code>12345</code> and
<code>67890</code>, are simple 32 bit integers.</p>
</li>
<li>
<p>I believe the <code>Generic</code> node refers to a Crystal generic class via the <code>Path</code> node
shown above, in this example <code>Array(T)</code>. When <code>MainVisitor</code> processes the <code>Generic</code>
node, it sets <code>T</code> to the type <code>Int32</code>, arriving at the type <code>Array(Int32).class</code>.</p>
</li>
<li>
<p>The <code>Call</code> node looks up the method my code is calling (<code>unsafe_build</code>) and
uses the type from that method’s return value. I didn’t have time to explore
how method lookup works in Crystal, however, so I’m not sure about this.</p>
</li>
</ul>
<h2>Scratching the Surface</h2>
<p>Today we looked at a tiny piece of what the Crystal compiler can do. There are
many more types of AST nodes, each of which the <code>MainVisitor</code> class handles
differently. And there are many different visitor classes also, beyond
<code>MainVisitor</code>. When analyzing a more complex program Crystal has to understand
class and module definitions, instance and class variables, type annotations,
different lexical scopes, macros, and much, much more. Crystal will need all of
this information later, during the code generation phase, the next step that
follows semantic analysis.</p>
<p>But I hope this article gave you a sense of what sort of work a compiler has to
do in order to understand your code. As you can see, for a statically typed
language like Crystal the compiler spends much of its time identifying all of
the types in your code, and determining which programming constructs or AST
nodes have which types.</p>
<p>Next time I’ll look at code generation: Now that Crystal has identified the
variables, function calls and types in my code it is ready to generate the
machine language code needed to execute my program. To do that, it will
leverage the LLVM framework.</p>
</content></entry><entry><title>Reading Code Like a Compiler</title><link href="https://patshaughnessy.net/2021/12/22/reading-code-like-a-compiler" rel="alternate"></link><id href="https://patshaughnessy.net/2021/12/22/reading-code-like-a-compiler" rel="alternate"></id><published>2021-12-22T00:00:00Z</published><updated>2021-12-22T00:00:00Z</updated><category>Crystal</category><author><name>Pat Shaughnessy</name></author><summary type="html"><div style="float: left; padding: 8px 30px 0px 0px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2021/12/22/depth-of-field.jpg"><br/>
<i>Imagine trying to read an entire book while <br/>
focusing on only one or two words at a time
</i>
</div>
<p>We use compilers every day to parse our code, find our programming mistakes and
then help us fix them. But h</summary><content type="html"><div style="float: left; padding: 8px 30px 0px 0px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2021/12/22/depth-of-field.jpg"><br/>
<i>Imagine trying to read an entire book while <br/>
focusing on only one or two words at a time
</i>
</div>
<p>We use compilers every day to parse our code, find our programming mistakes and
then help us fix them. But how do compilers read and understand our code? What
does our code look like to them?</p>
<p>We tend to read code like we would read a human language like English. We
don’t see letters; we see words and phrases. And in a very natural way we use
what we just read, the proceeding sentence or paragraph, to give us the context
we need to understand the following text. And sometimes we just skim over text
quickly to gleam a bit of the meaning without even reading every word.</p>
<div style="clear: both"></div>
<p>Compilers aren’t as smart as we are. They can’t read and understand entire
phrases or sentences all at once. They read text one letter, one word at at
time, meticulously building up a record of what they have read so far.</p>
<p>I was curious to learn more about how compilers parse text, but where should I
look? Which compiler should I study? Once again, like in my last few posts,
Crystal was the answer.</p>
<h2>Crystal: A Compiler Accessible to Everyone</h2>
<p><a href="https://crystal-lang.org">Crystal</a> is a unique combination of simple, human
syntax inspired by Ruby, with the speed and robustness enabled by static types
and the use of <a href="https://llvm.org">LLVM</a>. But for me the most exciting thing
about Crystal is how the Crystal team implemented both its standard library and
compiler using the target language: Crystal. This makes Crystal’s internal
implementation accessible to anyone familiar with Ruby. For once, you don’t
have to be a hard core C or C++ developer to learn how a compiler works.
Reading code not much more complex than a Ruby on Rails web site, I can take a
peek under the hood of a real world compiler to see how it works internally.</p>
<p>Not only did the Crystal team implement their compiler using Crystal, they also
wrote it by hand. Parsing is such a tedious task that often developers use a
parser generator, such as <a href="https://www.gnu.org/software/bison/">GNU Bison</a>, to
automatically generate the parse code given a set of rules. This is how Ruby
works, for example. But the Crystal team wrote their parser directly in
Crystal, which you can read in
<a href="https://github.com/crystal-lang/crystal/blob/master/src/compiler/crystal/syntax/parser.cr">parser.cr</a>.</p>
<p>Along with a readable compiler, I need a readable program to compile. I decided to
reuse the same array snippet from my last post:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]
</span><span style="color:#000000;">puts arr[</span><span style="color:#d08770;">1</span><span style="color:#000000;">]</span></pre>
<p>This tiny Crystal program creates an array of two numbers and then prints out
the second number. Simple enough: You and I can read and parse these two lines
of code in one glance and in a fraction of a second understand what it does.
Even if you’re not a Crystal or Ruby developer this syntax is so simple you can
still understand it.</p>
<p>But the Crystal compiler can’t understand this code as easily as we can.
Parsing even this simple program is a complex task for a compiler.</p>
<h2>How the Crystal Compiler Sees My Code</h2>
<p>Before parsing or running the code above, Crystal converts it into a series of
tokens. To the Crystal compiler, my program looks like this:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/tokens.png"><br/></p>
<p>The first <code>IDENT</code> token corresponds to the <code>arr</code> variable at the beginning of the
first line. You can also see two <code>NUMBER</code> tokens: the <a href="https://github.com/crystal-lang/crystal/blob/master/src/compiler/crystal/syntax/lexer.cr">Crystal tokenizer
code</a>
converted each series of numerical digits into single tokens, one for 12345 and
the other for 67890. Along with these tokens you can also see other tokens for
punctuation used in Crystal syntax, like the equals sign and left and right
square brackets. There is also a new line token and one for the end of the
entire file.</p>
<h2>Reading a Book One Word at a Time</h2>
<p>To understand my code, Crystal processes these tokens one at a time, stepping
tediously through the entire program. What a slow, painful process!</p>
<p>How would we read if we could only see one word at a time? I imagine covering
the book I’m trying to read with a piece of paper or plastic that had a small
hole in it… and that through the hole I could only see one word at a time. How
would I read one entire page? Well, I’d have to move the paper around, showing
one word and then another and another. And how would I know where to move the
paper next? Would I simply move the paper forward one word at at time? What if
I forgot some word I had seen earlier? I’d have to backtrack - but how far back
to go? What if the meaning of the word I was looking at depended on the words
that followed it? This sounds like a nightmare.</p>
<p>To read like this, if it was even possible at all, I’d have to have a well
thought out strategy. I’d have to know exactly how to move that plastic screen
around. When you can only read one word at a time, deciding which word to read
next becomes incredibly important. I would need an algorithm to follow.</p>
<p>This is what a parser algorithm is: Some set of rules the parse code can use to
interpret each word, and, equally important, to decide which word to read next.
Crystal’s parse code is over 6000 lines long, so I won’t attempt to completely
explain it here. But there’s an underlying, high level algorithm the parse code
uses:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/pattern-recurse-record.png"><br/></p>
<p>First, the parser compares the current token, and possibly the following or
previous tokens as well, to a series of expected patterns. These patterns
define the syntax the parser is reading. Second, the parser recurses. It calls
itself to parse the next token, or possibly multiple next tokens depending on
which pattern the parser just matched. Finally, the parser records what it saw:
which pattern matched the current token and the results of the recursive calls
to itself, for future reference.</p>
<h2>Matching a Pattern</h2>
<p>The best way to understand how this works is to see it in action. Let’s follow
along with the Crystal compiler as it parses the code I showed above:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]
</span><span style="color:#000000;">puts arr[</span><span style="color:#d08770;">1</span><span style="color:#000000;">]</span></pre>
<p>Recall Crystal already converted this code into a token stream:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/token-line.png"><br/></p>
<p>(To be more accurate, Crystal actually converts my code into tokens as it goes.
The parse code calls the tokenizer code each time it needs a new token. But
this timing isn’t really important.)</p>
<p>As you might expect, Crystal starts with the first token, <code>IDENT</code>.</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/process-token1.png"><br/></p>
<p>What does this mean? How does Crystal interpret <code>arr</code>? <code>IDENT</code> is short for
identifier, but what role does this identifier play? What meaning does <code>arr</code> have
in my code?</p>
<p>To decide on the correct meaning, the Crystal parser compares the <code>IDENT</code> token
with a series of patterns. For example Crystal looks for patterns like:</p>
<ul>
<li>
<p>a ternary expression <code>a ? b : c</code></p>
</li>
<li>
<p>a range <code>a..b</code></p>
</li>
<li>
<p>an expression using a binary operator, such as: <code>a + b</code>, etc.</p>
</li>
<li>
<p>and many more…</p>
</li>
</ul>
<p>It turns out none of these patterns apply in this case, and Crystal ends up
selecting a default pattern which handles the most common code pattern: a
function call. Crystal decides that when I wrote <code>arr</code> I intended to call a
function called <code>arr</code>.</p>
<p>I often tell people I work with at my day job that I have really bad memory.
And it’s true. I constantly have to google the syntax or return values of
functions. I often forget what some code means even just a month after I wrote
it. And the Crystal compiler is no better: As soon as it processes that <code>IDENT</code>
token above, it has to write down what it decided that token meant or else it
would forget.</p>
<p>To record the function call, Crystal creates an object:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/ast1.png"><br/></p>
<p>As we’ll see in a moment, Crystal builds up a tree of these objects, called an
<a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
Tree</a> (AST). The AST will
later serve as a record of the syntactic structure of my code.</p>
<h2>Recursively Calling Itself</h2>
<p>Parsing is inherently a recursive process. Unlike English text, Crystal
expressions can be nested one inside another to any depth. Although I suppose
English grammar is somewhat recursive and can be nested to some degree. I
wonder if the grammars for some other human languages are more recursive than
English? Interesting question.</p>
<p>For parsing a programming language like Crystal, the simplest thing for the
parser code to do is recursively call itself. And it does this based on the
pattern it just matched. For example, if Crystal had parsed a plus sign, it
would need to recursively call itself to parse the values that appeared before
and after the plus.</p>
<p>In this example, Crystal has to decide what arguments to pass to this call to
the <code>arr</code> function. Did I write <code>arr(1, 2, 3)</code> or just <code>arr</code>? Or <code>arr()</code>? What were
the values 1, 2 and 3? Each of these could be a complex expression in their own
right, maybe appearing inside of parentheses, a compound value like an array or
maybe yet another function call.</p>
<p>To find the arguments of the function call, inside the recursive call to the
parse code Crystal proceeds forward to process the next two tokens:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/process-token2.png"><br/></p>
<p>Crystal skips over the space, and then encounters the equals sign. Suddenly
Crystal realizes it was wrong! The <code>arr</code> identifier wasn’t a reference to a
function at all, it was a variable declaration. Yes, sometimes compilers change
their minds while reading, just like we do!</p>
<h2>Recording an AST Node</h2>
<p>To record this new, revised syntax, Crystal changes the <code>Call</code> AST node it
created earlier to an <code>Assign</code> AST node, and creates a new <code>Var</code> AST node to
record the variable being assigned to:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/ast2.png"><br/></p>
<p>Now the AST is starting to resemble a tree. Because of the recursive nature of
parse algorithm, this tree structure is an ideal way of record what the
compiler has parsed so far. Trees are recursive too: Each branch is a tree in
its own right.</p>
<h2>Rinse and Repeat</h2>
<p>But what value should Crystal assign to that variable? What should appear in
the AST as the value attribute of the <code>Assign</code> node?</p>
<p>To find out, the Crystal compiler recursively calls the same parsing algorithm
again, but starting with the <code>[</code> token:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/process-token3.png"><br/></p>
<p>Following the pattern match, record and recurse process, the Crystal compiler
once again matches the new token, <code>[</code>, with a series of expected patterns. This
time, Crystal decides that the left bracket is the start of literal array
expression and records a new AST node:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/array-literal1.png"><br/></p>
<p>But before inserting it into the syntax tree, Crystal recursively calls itself
to parse each of the values that appear in the array. The array literal pattern
expects a series of values to appear separated by spaces, so Crystal proceeds
to process the following tokens, looking for values separated by commas:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/process-token4.png"><br/></p>
<p>After encountering the comma, Crystal recursively calls the same parse code
again on the previous token or tokens that appeared before the comma, because
the array value before the comma could be another expression of arbitrary depth
and complexity. In this example, Crystal finds a simple numeric array element,
and creates a new AST node to represent the numeric value:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/number-literal1.png"><br/></p>
<p>After reading the comma, Crystal calls its parser recursively again, and finds
the second number:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/number-literal2.png"><br/></p>
<p>Remember Crystal has a bad memory. With all these new AST nodes, Crystal will
quickly forget what they mean. Fortunately, Crystal reads in the right square
bracket and realizes I ended the array literal in my code:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/process-token5.png"><br/></p>
<p>Now those recursive calls to the parse code return, and Crystal assembles these
new AST nodes:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/array-literal2.png"><br/></p>
<p>…and then places them inside the larger, surrounding AST:</p>
<p><img src="https://patshaughnessy.net/assets/2021/12/22/ast3.png"><br/></p>
<p>After this, these recursive calls return and the Crystal compiler moves on to
parse the second line of my program.</p>
<h2>A Complete Abstract Syntax Tree</h2>
<p>After following the Crystal parser for a while, I added some debug logging code
to the compiler so I could see the result. Here’s my example code again:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]
</span><span style="color:#000000;">puts arr[</span><span style="color:#d08770;">1</span><span style="color:#000000;">]</span></pre>
<p>And here’s the complete AST the Crystal compiler generated after parsing my
code. My debug logging indented each line to indicate the AST structure:</p>
<pre type="console">&lt;Crystal::Expressions exp_count=3 >
&lt;Crystal::Require string=prelude >
&lt;Crystal::Assign target=Crystal::Var value=Crystal::ArrayLiteral >
&lt;Crystal::Var name=arr >
&lt;Crystal::ArrayLiteral element_count=2 of=Nil name=Nil >
&lt;Crystal::NumberLiteral number=12345 kind=i32 >
&lt;Crystal::NumberLiteral number=67890 kind=i32 >
&lt;Crystal::Call obj= name=puts arg_count=1 >
&lt;Crystal::Call obj=arr name=[] arg_count=1 >
&lt;Crystal::Var name=arr >
&lt;Crystal::NumberLiteral number=1 kind=i32 ></pre>
<p>Each of these values is a subclass of the <code>Crystal::ASTNode</code> superclass.
Crystal defines all of these in the
<a href="https://github.com/crystal-lang/crystal/blob/master/src/compiler/crystal/syntax/ast.cr">ast.cr</a>
file. Some interesting details to note:</p>
<ul>
<li>
<p>The top level node is called <code>Expressions</code>, and more or less holds one
expression per line of code.</p>
</li>
<li>
<p>The second node, the first child node of <code>Expressions</code>, is called <code>Require</code>.
The surprise here is that I didn’t even put a <code>require</code> keyword in my
program! Crystal silently inserts <code>require prelude</code> to the beginning of
all Crystal programs. The “prelude” is the Crystal standard library, the code
that defines <code>Array</code>, <code>String</code> many other core classes. Reading the AST allows
us to see how the Crystal compiler does this automatically.</p>
</li>
<li>
<p>The third node and its children are the nodes we saw Crystal create above for
my first line of code, the array literal and the variable it is assigned to.</p>
</li>
<li>
<p>Finally, the last branch of the tree shows the call to <code>puts</code>. This time
Crystal’s default guess about identifiers being function calls was correct.
Another interesting detail here is that the inner call to the <code>[]</code> function
was not generated by an identifier, but by the <code>[</code> token. This was one of the
patterns the Crystal parser checked for after one of the recursive parse
calls.</p>
</li>
</ul>
<h2>Next Time</h2>
<p>What’s the point of all of this? What does the Crystal compiler do next with
the AST? This tree structure is a fantastic summary of how Crystal parsed my
code, and, as we’ll see later, also provides a convenient way for Crystal later
to process my code and transform it in different ways.</p>
<p>When I have time, I plan to write a few more posts about more of the inner
workings of the Crystal compiler and the LLVM framework, which Crystal later
uses to generate my x86 executable program.</p>
</content></entry><entry><title>Find Your Language’s Primitives</title><link href="https://patshaughnessy.net/2021/11/29/find-your-languages-primitives" rel="alternate"></link><id href="https://patshaughnessy.net/2021/11/29/find-your-languages-primitives" rel="alternate"></id><published>2021-11-29T00:00:00Z</published><updated>2021-11-29T00:00:00Z</updated><category>Crystal</category><author><name>Pat Shaughnessy</name></author><summary type="html"><div style="float: right; padding: 8px 0px 30px 30px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2021/11/29/dig1.jpg"><br/>
<i>If you dig into your programming language's syntax, you might <br/>discover that it is capable of much more than you thought it was.
</i>
</div>
<p>Wikipedia defines “Language Primitive” <a href="https://en.wikipedia.org/wi</summary><content type="html"><div style="float: right; padding: 8px 0px 30px 30px; text-align: center; line-height:18px">
<img src="https://patshaughnessy.net/assets/2021/11/29/dig1.jpg"><br/>
<i>If you dig into your programming language's syntax, you might <br/>discover that it is capable of much more than you thought it was.
</i>
</div>
<p>Wikipedia defines “Language Primitive” <a href="https://en.wikipedia.org/wiki/Language_primitive">this
way</a>:</p>
<blockquote>
In computing, language primitives are the simplest elements available in a
programming language. A primitive is the smallest 'unit of processing'
available to a programmer of a given machine, or can be an atomic element of an
expression in a language.
</blockquote>
<p>By looking at a language’s primitives, we can learn what kind of code will be
easy to write or impossible to express, and what types of problems the language
was intended to solve. Whether you’ve been using a language for years, or just
now learning a new language for fun, take the time to find and learn about your
language’s primitives. You might discover something you never knew, and will
come away with a deeper understanding of how your programs work.</p>
<p>As an example today, I’m going to look at how arrays work in three languages:
Ruby, Crystal and x86 Assembly Language.</p>
<h2>Retrieving an Array Element In Ruby</h2>
<p>In Ruby I can create an array and later access an element like this:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]
</span><span style="color:#000000;">puts arr[</span><span style="color:#d08770;">1</span><span style="color:#000000;">]</span></pre>
<p>This code would be the same or almost the same in many other programming
languages. It just means: “find the second element of the array and print it to
stdout.”</p>
<p>But how does this actually work? In Ruby, the <span class="code">Array</span>
class and all of its methods are language primitives. This means array methods
like <span class="code">[]</span> or <span class="code">[]=</span> cannot be
broken down into smaller pieces of Ruby code. As Wikipedia says, these methods
are the smallest unit of processing available to Ruby programmers working with
arrays.</p>
<p><img src="https://patshaughnessy.net/assets/2021/11/29/primitive1.png"><br/></p>
<p>Ruby hides the details of how arrays actually work from us. To learn how Ruby
actually saves and retrieves values from an array, we would need to switch
languages and drop down a level of abstraction, and read the C implementation
in the Ruby source code:
<a href="https://github.com/ruby/ruby/blob/master/array.c">array.c</a>. There’s nothing
wrong with this, of course. Ruby developers use arrays every day without any
trouble. But switching from Ruby to C makes understanding internal details much
more difficult.</p>
<h2>Retrieving an Array Element In Crystal</h2>
<p>This Fall I decided to learn more about <a href="https://crystal-lang.org">Crystal</a>, a
statically typed language with syntax that resembles Ruby. I expected to find a
similar <span class="code">Array#[]</span> primitive. But surprisingly, I was
wrong!</p>
<p>The same code from above also works in Crystal:</p>
<pre style="background-color:#ffffff;">
<span style="color:#000000;">arr </span><span style="color:#4f5b66;">= </span><span style="color:#000000;">[</span><span style="color:#d08770;">12345</span><span style="color:#000000;">, </span><span style="color:#d08770;">67890</span><span style="color:#000000;">]
</span><span style="color:#000000;">puts arr[</span><span style="color:#d08770;">1</span><span style="color:#000000;">]</span></pre>
<p>In Crystal, arrays are not language primitives because the Crystal standard
library implements arrays using Crystal itself. The <span
class="code">Array#[]</span> method is not the smallest unit of processing
available to Crystal programmers. Let’s dig into the details and divide up the
<span class="code">[]</span> method into smaller and smaller pieces to see how
the Crystal team implemented it.</p>
<p>Reading
<a href="https://github.com/crystal-lang/crystal/blob/master/src/indexable.cr#L56">src/indexable.cr</a>
in the Crystal standard library, here’s the implementation of <span class="code">Indexable#[]</span>
which the array class uses when I call <span class="code">arr[1]</span> above:</p>
<pre style="background-color:#ffffff;">
<span style="color:#a7adba;"># Returns the element at the given *index*.
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># Negative indices can be used to start counting from the end of the array.
</span><span style="color:#a7adba;"># Raises `IndexError` if trying to access an element outside the array&#39;s range.
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># ```
</span><span style="color:#a7adba;"># ary = [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]
</span><span style="color:#a7adba;"># ary[0] # =&gt; &#39;a&#39;
</span><span style="color:#a7adba;"># ary[2] # =&gt; &#39;c&#39;
</span><span style="color:#a7adba;"># ary[-1] # =&gt; &#39;c&#39;
</span><span style="color:#a7adba;"># ary[-2] # =&gt; &#39;b&#39;
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># ary[3] # raises IndexError
</span><span style="color:#a7adba;"># ary[-4] # raises IndexError
</span><span style="color:#a7adba;"># ```
</span><span style="color:#000000;">@[AlwaysInline]
</span><span style="color:#a71d5d;">def </span><span style="color:#795da3;">[]</span><span style="color:#000000;">(index : Int)
</span><span style="color:#000000;"> fetch(index) { </span><span style="color:#795da3;">raise </span><span style="color:#008080;">IndexError</span><span style="color:#000000;">.</span><span style="color:#795da3;">new </span><span style="color:#000000;">}
</span><span style="color:#a71d5d;">end</span></pre>
<p>The Crystal team implemented <span class="code">[]</span> using another method
called <span class="code">fetch</span>:</p>
<pre style="background-color:#ffffff;">
<span style="color:#a7adba;"># Returns the element at the given *index*, if in bounds,
</span><span style="color:#a7adba;"># otherwise executes the given block with the index and returns its value.
</span><span style="color:#a7adba;">#
</span><span style="color:#a7adba;"># ```
</span><span style="color:#a7adba;"># a = [:foo, :bar]
</span><span style="color:#a7adba;"># a.fetch(0) { :default_value } # =&gt; :foo
</span><span style="color:#a7adba;"># a.fetch(2) { :default_value } # =&gt; :default_value
</span><span style="color:#a7adba;"># a.fetch(2) { |index| index * 3 } # =&gt; 6
</span><span style="color:#a7adba;"># ```
</span><span style="color:#a71d5d;">def </span><span style="color:#795da3;">fetch</span><span style="color:#000000;">(index : Int)
</span><span style="color:#000000;"> index </span><span style="color:#4f5b66;">=</span><span style="color:#000000;"> check_index_out_of_bounds(index) </span><span style="color:#a71d5d;">do
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">return yield </span><span style="color:#000000;">index
</span><span style="color:#000000;"> </span><span style="color:#a71d5d;">end
</span><span style="color:#000000;"> unsafe_fetch(index)
</span><span style="color:#a71d5d;">end</span></pre>
<p>Neither the <span class="code">[]</span> operator nor the <span
class="code">fetch</span> method are language primitives. To find a language
primitive, I need to keep dividing the code up into smaller and smaller pieces,
until it can’t be divided any further. The same process a chemist would use to
break up some material into smaller and smaller molecules until they are left