-
Notifications
You must be signed in to change notification settings - Fork 4
/
summary.tex
1152 lines (988 loc) · 78.6 KB
/
summary.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper,10pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[margin=1.5cm]{geometry}
\usepackage{amsmath}
\usepackage{listings}
\usepackage{tikz}
\usepackage{siunitx}
\usepackage{pgfplots}
\usepackage{hyperref}
\usepgfplotslibrary{units} % Allows to enter the units nicely
\title{Codegeneratoren}
\author{Sebastian Geiger\\ \small{Twitter: @lanoxx, Email: [email protected]}}
\lstset{ %
%backgroundcolor=\color{background}, % choose the background color; you must add \usepackage{color} or \usepackage{xcolor}
basicstyle=\small\ttfamily, % the size of the fonts that are used for the code
%commentstyle=\color{mygreen}, % comment style
%frame=lines, % adds a frame around the code
columns=fixed,
%keywordstyle=\color{blue}, % keyword style
language=Java, % the language of the code
numbers=left, % where to put the line-numbers; possible values are (none, left, right)
numbersep=5pt, % how far the line-numbers are from the code
%numberstyle=\scriptsize\color{mygray}, % the style that is used for the line-numbers
showspaces=false, % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
showstringspaces=false, % underline spaces within strings only
showtabs=false, % show tabs within strings adding particular underscores
%stringstyle=\color{mymauve}, % string literal style
tabsize=2 % sets default tabsize to 2 spaces
}
\sisetup{
round-mode = places,
round-precision = 2,
}
\begin{document}
\maketitle
\begin{abstract}
This is a very short summary of the topics that are tought in ``Codegeneratoren``. Code generators are the backend part of a compiler
that deal with the conversion of the machine independent intermediate representation into machine dependent instructions and the various
optimizations on the resulting code. The first two Sections deal with prerequisits that are required for the later parts. Section 1
discusses \textit{computer architectures} in order to give a better understanding of how processors work and why different optimizations
are possible. Section 2 discusses some of the higher level \textit{optimizations} that are done before the code generator does its work.
Section 3 discusses \textit{instruction selection}, which conserns the actual conversion of the intermediate representation into machine
dependent instructions. The later chapters all deal with different kinds of optimizations that are possible on the resulting code.
I have written this document to help my self studing for this subject. If you find this document useful or if you find errors or mistakes
then please send me an email.
\end{abstract}
\tableofcontents
\section{Computer architectures}
Codegenerators are dependent on the computer architecture they are designed for. Therefore a comprehensive understanding of the different
concepts of computer architectures is requried. This capter can be roughly split into two main aspects: \textbf{microarchitectures} and
\textbf{Instruction Set Architectures} sometimes abbreviated ISA. The concept of \textit{microarchitectures} referes to the hardware part
of a processor, that is its different parts such as execution units, pipelines, caches and such. The instruction set architecture defines
the interface between the processor and its users (usually assembler programmers or compiler writers). It is easy to see that a
codegenerator is dependent on the instruction set architecture it was designed for. For example a different codegenerator is needed for
CISC and RISC computers. But a codegenerator developer also needs a good understanding of the underlying microarchitecture in order to
better optimize the generated code. An example for such optimizations is the use of instruction level parallelism or code scheduling to
achieve a higher pipeline utilization. Below the most important concepts of microarchitectures and ISA are explained.
\subsection{Microarchitectures}
Each processor is composed of many different hardware parts, that are all responsible for different things. The following list gives an
overview of the most important parts and their functions:
\begin{itemize}
\item \textbf{Caches:} Store some of the program code and data from memory to allow faster access. Separate instruction and data caches
can allow simultaneous instruction and data access every cycle.
\item \textbf{In/out of order execution:} Realized by reorder buffers. Instructions are dispatched to the reorder buffer in-order but
the execution units can execute the instruction out-of-order.
\item \textbf{Register renaming:} Register renaming minimizes architectural resource dependencies, namely WAW
\footnote{\label{footnote-waw}WAW and WAR are abbreviations for data dependencies and are explained in Section \ref{sec:graphs}.}
and WAR \footnote{See footnote \ref{footnote-waw}} dependencies. If instructions are executed out-of-order, then it maybe the case
that a result needs to be written into a register while an earlier instruction still needs the old value. By renaming the register
and temporarily storing the new result in a different register the data dependency between those instructions can be broken up. On
the microarchitectural level \textbf{rename buffers} are used to implement register renaming.
\item \textbf{Precise interrupts:} An interrupt is precise if the state of the processor includes the data of all instructions before
the instruction that caused the interrupt and no data from the instruction causing the interrupt or later instructions. Processors
that do not have precise interrupts are quite difficult to program.
\item \textbf{Reorder buffer:} A reorder buffer is used to save the original order and the states of instructions as they are dispatched
to execution units. Reorder buffers are used to implement prceise interrupts, because they allow the processor to enforce in-order
completion of instructions even though they maybe executed out-of-order by the execution units.
\item \textbf{Completion stage:} Important for precise interrupts. Also separates completion stage and write-back stage to support load-
with-update instructions.
\item \textbf{Branch prediction:} A technique to avoid pipeline stalling. The processor predicts if a branch is taken and what its
target address is before this information becomes available. If the prediction is correct then the pipeline continues to process
with full through put. If the prediction is wrong, then the pipeline must discard all pending operations and restart at the
correct location.
\item \textbf{Branch target address cache:} Used to predict the target address of a branch or jump instruction. The BTAC is accessed
with the fetch address and returns the target address.
\item \textbf{Branch history table:} Stores information about whether a branch was taken or not, depending on the complexity of the
branch history table it is also possible to store patterns, such as, that a branch was only taken every second time.
\item \textbf{Reservation station:} Used to keep an instruction until its operands become available (The PowerPC 604 has two
reservation stations for each execution unit). This way instructions can be dispatched even if their operands are not yet
available.
\item \textbf{Superscalar architecture:} A superscalare processor executes more then one instruction during a clock cycle by
simultaneously dispatching multiple instructions to redundand functional units on the processor, this can include different
functional units such as branch, integer, multiply, floating-point units, but also redundand units
of the same type such as four integer units. Examples for superscalar architectures in common processors are:
\begin{itemize}
\item Hyperthreading (duplicates register space for the thread context)
\item Multiple execution units (often combined with out of order execution)
\end{itemize}
\item \textbf{Pipelining:} Increases the \textbf{frequency} of the processor, but also the \textbf{latency}, one instruction requires
multiple clock cycles. Instructions handling is split into several stages where each stage can handle one instruction at a time. The
instruction passes through several stages before its computation is complete. Typical stages are: (Fetch, Decode, Dispatch, Execute,
Write-back, etc.). The latency usually covers only the execution stage. Example: In the first cycle an instruction is fetched by the
fetch stage, then in the next cycle an new instruction is fetched by the fetch station and the first instruction moves on to the decode
station where it is decoded. \textbf{Problems} of pipelining are called \textbf{hazards} and are caused by the increased latency of
instruction results. Such a hazard often appears if a second instruction depends on the result of a first instruction. It can be solved
with different techniques such as out-of-order execution. Another problem appears in combination with branch prediction. If a branch is
wrongly predicted, then the pipeline becomes invalid and must be restarted, which leads to a penalty of several cycles.
\item \textbf{Pipeline bypass:} A way to prevent pipeline hazards is a pipeline bypass. This allows a value that was computed by the
previous instruction to be read by the next instruction without the need to wait until the instruction has gone through the complete and
writeback stages.
\end{itemize}
\subsection{Instruction Set Architectures}
An instruction set architecture defines the programming interface of the processor. It includes aspects such as instructions, registers,
data types and memory access, as well as interrupt handling and exception handling. The instruction set architecture also defines how
external I/O is done. Common instruction set architectures are:
\begin{itemize}
\item \textbf{RISC} Reduced Instruction Set Computer: Referes to a load-store architecture with fixed width instructions.
\item \textbf{CISC} Complex Instruction Set Computers: Architecture which allows complex memory access, supports complex instructions
and variable width instructions.
\item \textbf{VLIW} Very Long Instruction Word: Architectures which combine multiple instructions for parallel execution. In particular
instructions can be explicitly specified to be executed in parallel.
\end{itemize}
There are two very important differences between CISC and RISC, one is the way memory access works, the second is the instruction width.
RISC uses a load-store architecture in which instructions that operate on the memory are only used to load data into registers or store
data in memory, while all computation happens on registers. CISC on the otherhand also allows operations to modify data directly inside
the memory. The second big difference is the width of instructions. RISC has a fixed instruction width (such as 4 bytes) where as CISC
has a variable instruction width.
It is also important to note that pipelining and superscalar architectures can be problematic on CISC processors due to the complexity of
CISC instructions. Therefore CISC processors have introduced a concept called micro-operations, where instructions are first converted
into one or more micro-operation and are then dispatched and executed by the processor. This fact however does not turn a CISC processor
into a RISC processor, it is just an implementation detail of the microarchitecture of CISC processors.
\subsubsection*{Instruction level parallelism}
Instruction level parallelism (ILP) referes to the fact that many instructions are independent of each other and can be
executed simulaneously. It is used as a measure for the degree of parallelism in a given program. There are different techniques to
exploit this inherent parallelism in programs.
\begin{itemize}
\item In \textbf{Software} ILP can be exploited through a rescheduling of the instructions by the compiler.
\item In \textbf{Hardware} the ILP can be exploited by many techniques of the microarchitecture such as \textit{instruction pipelining},
\textit{superscalar execution}, \textit{out-of-order exectuion}, \textit{register renaming} and \textit{branch prediction}.
\end{itemize}
\subsection{SIMD Instructions}
Another concept commonly found in processors is support for SIMD instructions.
\begin{quote}
\textit{''SIMD instructions provide a form of vectorization where a large machine word is viewed as a vector of subwords and the same
operation is performed on all subwords in parallel``} \cite{simd}.
\end{quote}
For example a SIMD instruction could operate on vectors with a size of 128bit and the data elements could be 16bit wide (short integers).
An vector of a SIMD instruction would therefore contain eight data elements. If the processor supports a SIMD \lstinline{add} instruction
this allows two vectors of 8 elements to be added into a vector with 8 elements with a single instruction. When the SIMD instruction is
executed, the processor effectively performs eight add operations simulatenously. In that regard SIMD instructions are similar to
superscalar processors, but have a much higher degree of parallelism.
The SIMD concept is not tied to a particular ISA but can be added to RISC, CISC or even VLIW architectures. On the microarchitectural
level the processor has special SIMD registers and SIMD execution units to support the different SIMD instructions.
\section{Optimizations}
\label{sec:optimization}
This section describes high-level optimizations that are performed by the compiler on the intermediate representation of a program. At
this point the code is still machine independent and instruction selection has not yet been performed.
\subsection{Static Single Assignment}
Static single assignment (SSA) is an optimization required to enable most other optimizations, it is a form where each variable has only
a single value assignment. Additionally every variable must be defined before it is used. Every code can be transformed into SSA form by
adding indexes to variables. For example
\parbox{10cm}{
\begin{flalign*}
i &= 5 + 7 &\\
z &= 8\\
i &= i + z
\end{flalign*}
}
\noindent becomes:
\parbox{10cm}{
\begin{flalign*}
i_1 &= 5 + 7 &\\
z &= 8\\
i_2 &= i_1 + z
\end{flalign*}
}
\noindent When dealing with branches, $\varphi$-nodes are added to merge variables from separate branches:
\begin{lstlisting}[xleftmargin=.5cm,numbers=none,mathescape=true,columns=flexible,basicstyle=\ttfamily]
if(a)
$i_1$ = 10
else
$i_2$ = 20
$i_3$ = $\varphi(i_1,i_2)$
\end{lstlisting}
\subsection{Graph structures and memory dependencies}
\label{sec:graphs}
Graph data structures are also required for most optimizations and contain information about data dependencies between instructions, they
are also often called \textbf{dependency graphs}. In general graphs can be separated into two categories: acyclic graphs and
cyclic graphs. \textbf{Acyclic graphs} are often used to model code
inside basic blocks\footnote{
The concept of basic blocks is introduced in the compiler construction lecture and refers to a block of code with only one entry and
one exit point. That means in particular that basic blocks cannot contain loops or branches.
}
or extended basic blocks. On the otherhand \textbf{cyclic graphs} are usually used to model loops or recursion and usually extend over a
whole proceedure and not just a basic block. In such graphs instructions (or operations) are represented as vertices and data dependencies
as edges. Depending on where the graph is used vertices and edges can be annotated with additional information such as delayes between
instructions.
The data dependencies that form the edges of a graph can be categorized into three kinds of dependencies:
\begin{itemize}
\item true dependency or \textbf{read after write} (RAW)
\parbox{5cm}{
\begin{flalign*}
A &= B + C & \text{write A}\\
D &= A + 2 & \text{read A}
\end{flalign*}
}
\item anti dependency or \textbf{write after read} (WAR)
\parbox{5cm}{
\begin{flalign*}
A &= B + C & \text{read B}\\
B &= D - 2 & \text{write B}
\end{flalign*}
}
\item Output dependency or \textbf{write after write} (WAW)
\parbox{5cm}{
\begin{flalign*}
A &= B + C & \text{write A}\\
D &= A + 2 \\
A &= E + F & \text{write A}
\end{flalign*}
}
\end{itemize}
Many of the optimizations that are introduced in this and the later chapters require a detailed knowledge of the data dependencies between
instructions.
\subsection{Machine independent optimizations}
Machine independent optimizations are performed on the intermediate representation and are independent of the machine (i.e the instruction
set architecture) being used.
\begin{itemize}
\item Scalar optimizations:
\begin{itemize}
\item \textbf{Constant Evaluation:} resolves expressions of multiple constans such as 5+7 which is resolved to 12
\item \textbf{Constant Propagation:} propagates the constant to places in the code where variables refer to constants
\item \textbf{Copy Propagation:} replaces the use of a variable with the use of another variable of equal value
\item \textbf{Common subexpression elemination:} Opposite of \textbf{rematerialization}
\item \textbf{Scalar expansion:} Propmote scalar to vector/array inside loop (used in SIMD and VLIW)
\item \textbf{Strength reduction:} replaces expensive instructions by cheaper ones (eg. replace $5*a$ with $a>>2+a$ since shift and add are cheaper than multiplication). Another meaning of strength reductions can concern address optimizations in connection with indiction variables.
\end{itemize}
\item Progam optimizations:
\begin{itemize}
\item Dead code elemination (reachable vs. useless code elimination)
\item Function inlining (Speed) vs. Procedural abstraction (Code size)
\item Tailcall elimination: Resolves recursion into iterative loops (very common in logic programming)
\item Peephole optimization (very small optimizations 2-3 instruction, hard to analyze)
\end{itemize}
\item Induction variable elimination: is an optimization that tries to identify and remove the indiction variables inside loops.
\begin{itemize}
\item The improved loop uses pointer arithmetic instead of an index variable
\item Invariant code motion moves code that does not depend on the loop body out of the loop body
\item Use information about induction variable for optimization
\end{itemize}
\item Loop optimizations:
\begin{itemize}
\item Loop interchange (change inner and outer loop), the inverse function is still loop interchange
\item Loop fusion vs. Loop fission (Splitting) (or distribution)
\begin{itemize}
\item Loop peeling is special case of loop splitting that peels of one or more iterations from the beginning or the end of the loop and performs these operations separately before or after the loop.
\end{itemize}
\item Loop collapsing (collases inner and outer loop) vs. Strip mining
\item Strip mining transforms a single loop into a doubly nested loop where the outer loop is incremented by a certain block size
and the inner loop is incremented by one:
\begin{lstlisting}
for(int j=0; j<N; j+=32) {
for(int i=j; i<min(j+31, N); i++) {
//loop body
}
}
\end{lstlisting}
\item Loop vectorization (SIMD, VLIW): The compiler analyzes the dependence graph and if no loops are found it can make use of
special vector instructions of the target processor.
\item Loop concurrentization: Loops are split into several partitions and each partition is computed on a different processor
concurrently.
\end{itemize}
\item Reorder transformation (like code scheduling but higher level, e.g. compiler frontend not backend). Uses sophisticated dependency
analysis.
\end{itemize}
\section{Instruction selection}
Instruction selection is a process which transforms the intermediate representation of the program into object code, by
selecting a machine instruction for each computation. The resulting object code after instruction selection is therefore machine
dependent. Instructions may have associated costs and the goal of instruction selection is to find a sequence of instructions,
which have minimal costs.
There exist several implementations for instruction selection, which are presented in the following sections.
\subsection{iBurg}
iBurg is an improved version of \textbf{Burg} which was not discussed in the lecture. Its main advantage is that its much
simpler to implement (about 700 lines of code). Burg computes costs at generation time (compile-compile time). In theory iBurg
could support dynamic cost computation at instruction selection time (compile time), but does does not do so in practice for
backwards compatibility with Burg. Instructions are selected through tree pattern matching and dynamic programming. The goal of
tree pattern matching is to find a minimal cost cover for the data flow tree that represents the operations of the intermediate
language.
\subsection{BEG}
BEG stands for Back End Generator. It is not only used for instruction selection but also includes register allocation. The important
aspects of BEG are:
\begin{itemize}
\item It is more advanced then iBurg
\item Has a more powerful rule engine and supports \textbf{dynamic costs} and \textbf{conditions} for instruction selection.
\item Can do \textbf{register allocation}, but only locally on basic block level
\item Has two types of register allocation (on the fly, general register allocation)
\end{itemize}
\subsection{Dag-driven instruction selection with PBQP}
Another solution for instruction selection is the dag-driven instruction selection using SSA-Graphs and PBQP. Instruction
selection is modeled as a partitioned boolean quadratic problem, which is
an NP-Complete problem. PBQP tries to find a cost minimal assignment $h$ of variables $X$ to their respective domains $D$:
\[ h: X \rightarrow D\]
At the core of PBQP two functions are used, a local cost function $c$ and a related cost function $C$, the sum $f$ of both forms
the cost of an assignemnt for variables to their respective domains:
\[ f = \sum c_i + \sum{C_{ij}} \]
The instruction selection with PBQP works like this. For each node $u$ in the input DAG there exists a variable $x_u$ which is
mapped to an operation out of a list of possible operations for this node. The possible operations are defined by a normalized
graph grammar which consists of base rules and chain rules. The local cost function is defined as a
cost vector $c_u$ that defines the cost of assigning each operation to $x_u$. Additionally the related cost function is modelled as
a matrix $C_{uv}$. It is used to \textit{``ensure consistency among base rules and to account for the costs of chain rules''}
\cite{pbqp-instruction-selection}. The values of $c_u$ and $C_{uv}$ are either \textbf{zero} for matching base rules,
\textbf{infinite} for incompatible
base rules or a value which is the \textbf{sum of transitive chain rule costs} in order to make two non-matching base rules
compatible. In summary: \textit{``A solution of PBQP determines which base rules and chain rules are to be selected''}
\cite{pbqp-instruction-selection}.
The PBQP based instruction selection also allows instruction selection for complex patterns such as ``div-mod'' or ``auto-increment'',
however this can lead to cycles in the graph.
\begin{itemize}
\item SSA graphs model the data flow through registers, but not the dataflow through memory. That is, SSA graphs do not reflect
memory dependencies. This can lead to cylces which must be broken up by adding potential data dependencies.
\item \textbf{Complex patterns} for instructions with multiple outputs such as ``div-mod'' or ``auto-increment for pointers'' can
cause cycles in the graph. This can happen if \textit{``a set of operations in the SSA graph is matched by a pattern for which
there exists a path in the SSA graph that exits and re-enters the complex pattern within the basic block''}
\cite{pbqp-instruction-selection}.
\end{itemize}
Complex patterns are matched by enumerating the set of all viable instances of a complex pattern. In the resulting set two instances $p$
and $q$ within a basic block $b$ may have a dependency $p \prec_b q$, which can lead to the cycles mentioned above. Such cycles are
broken up by adding additional constraints in the form of \textbf{cost matrices}, which \textit{''constrain the solution space such that
no cyclic data dependencies can be constructed in any valid solution``} \cite{pbqp-instruction-selection}.\\
An important difference between the different instruction selectors is when costs are computed. Burg and iBurg do it at compile-compile
time, where as BEG and PBQP do it are compile time.
\section{Register allocation}
\label{sec:register-allocation}
There are three important algorithms. Chaitin and Briggs which use a graph coloring approach and Chow-Hennesy which use a priority
function.
\subsection{Chaitin}
Chaitins builds a dependency graph and then tries to simplify the graph. \textbf{Simplification} happens by removing nodes
from the graph that have a \textit{degree} which is less than the amount of available registers. The nodes are placed on a stack
for later coloring. If the graph reaches a state where no such nodes can be removed, then Chaitin spills one node. The node to
spill is selected through a cost function that assigns each node a weight (cost divided by degree of the node) and then chooses
the
one where the cost for spilling is as low as possible. If a node is spilled, then the graph is rebuild without that node and the
steps are repeated until the graph is empty.
Otherwise when all nodes have been removed from the graph the coloring phase starts. In the \textbf{coloring phase} each node is
removed from the stack (in the reverse order than they were put on the stack) and added back to the graph, then it is assigned a
color that is different from all its neighbours. This approach prevents Chaitin from successfully coloring certain graph patterns
such as the diamond pattern (cycle of four nodes) with a minimum of 2 registers. Instead Chaitin needs one additional register.
\subsection{Briggs}
Briggs' algorithm works just like Chaitin's, but it performs the \textbf{spilling} of nodes \textbf{after} the
\textbf{coloring} phase where as Chaitin spills when it removes nodes from the graph. This has the effect that Briggs can color
diamond patterns in the graph with a minimum of 2 registers instead of 3 like Chaitin does. The difference between Chaitin and
Briggs can be seen in Figure \ref{chaitin-briggs}. When the algorithm starts the \textbf{coloring phase} it reinserts nodes from
the stack back into the graph and gives them a color different from all neighbours. The difference becomes obvious, when a node is
inserted back into the graph that has $k$ or more neighbours. Chaitin would have spilled such a node in the simplification phase.
With Briggs' algorithm on the otherhand there are \textbf{two possible cases}:
\begin{itemize}
\item It might be the case that some of its neighbours already have the same color and that there is still a color
available to assign to the node. In this case the node is colored and the algorithm proceeds to the next node.
\item On the other hand if no color is available, then Briggs leaves it uncolored and proceeds to the next node. After
all nodes have been reinserted into the graph the uncolored nodes are spilled and the proceedure is repeated.
\end{itemize}
\begin{figure}[h!b]
\centering
\begin{tikzpicture}[auto,
box/.style={inner sep=0mm,minimum height=2cm,minimum width=3cm,draw},thick,
edge/.style={->,>=stealth,line width=2pt},
]
\node[box] (build) at (0,0) {Build graph};
\node[box] (simplify) at (5,0) {Simplify};
\node[box] (color) at (10,0) {Color};
\node[box] (spill) at (5,3) {Insert Spill Code};
%start
\draw [edge,->] (-3,0) -- (build.west);
%loop
\draw [edge,->] (build.east) -- (simplify.west);
\draw [edge,->] (simplify.east) -- (color.west);
\draw [edge,->] (simplify) -- node [swap] {Chaitin's method} (spill);
\draw [edge,->] (color) -- (10,3) -- node [swap] {Briggs' method} (spill.east);
\draw [edge,->] (spill.west) -- (0,3) -- (build);
%end
\draw [edge,->] (color) -- (13,0);
\end{tikzpicture}
\caption{Register allocation phases of Briggs and Chaitin \cite{briggs}.}
\label{chaitin-briggs}
\end{figure}
\subsection{Chow Hennesy}
Chow-Hennesy assume that all variables reside in memory and are loaded into registers as required. This is the reverse concept of
Chaitin and Briggs. A priority function is used to calculate the saving when a variables is kept in a register
instead of the memory. The variables which have the greatest savings are then selected to be stored in registers. Chow-Hennesy's
register allocator works on procedure level.
If no suitable register is found, then live range splitting is applied.\\
Steps in the Chow-Hennesy algorithm:
\begin{enumerate}
\item Separate the \textbf{unconstrained live ranges} and handle to handle them in the last step. Unconstrained live ranges are those
which have less neighbours than available registers.
\item Repeat steps a) to c), each time assigning a color to a live range until all constraint live ranges have been assigned a color or
until there is no register left that can be assigned to any live range in any basic block.
\begin{enumerate}
\item Compute the priority function $P(lr)$ for each constraint live-range lr if has not been computed. If $P(lr)$ is negative, or if
$lr$ is uncolorable, mark $lr$ as a noncandidate, and leave it unassigned. Delete $lr$ from the interference graph.
\item Find the live range with the highest priority function, $lr*$, and assign a color to it. The color must not be in the
\textbf{forbidden} set for $lr*$. The forbidden set lists registers that are not allowed as target registers for the given live
range. For each live range that $lr*$ interferes with, update the \textbf{forbidden} information.
\item Check each live range interfering with $lr*$ to see if it needs to be split. A live range needs to be split if all of its target
registers are in its forbidden set.
\end{enumerate}
\item Assign colors to the unconstrained live ranges, each time choosing a color not belonging to their forbidden set.
\end{enumerate}
\paragraph{Priority function:} The priority function $P(lr)$ assigns a priority to each live range $lr$ to determine if it should be
allocated to a register. $P(lr)$ is computed as the sum of execution time savings divided by the number of units in each live range:
\begin{align}
S(lr) &= \sum_{i\in lr}{s_i \times w_i}\\
P(lr) &= \frac{S(lr)}{N}
\end{align}
with
\begin{itemize}
\item $s_i$ = Weighted sum of load costs, store costs and move costs per unit. A unit is normally a basic block.
\item $w_i$ = Estimated execution frequency
\item $N$ = Number of live units in the live range
\end{itemize}
\paragraph{Live range splitting:} If a symbolic register has a very long live range it can be split into two parts, this way each smaller
live range can be spilled separately if necessary. Live range splitting can be easily added to Chaitin and Briggs but was not present in
the original algorithms.
\subsection{Differences and related concepts}
Besides knowing the algorithms, it is important to know some of their differences. The following differences between the algorithms have
been emphasised in the lecture. Additional differences can be found in the paper of Chow-Hennesy, which contains a summary of the
differences between Chaitin and Chow-Hennesy's algorithms (see \cite{chowhennesy}, Section 6).
\begin{itemize}
\item Both Chaitin and Briggs algorithms need multiple iterations where as Chow-Hennesy finishes in one iteration. The reason is that
for Chaitin and Briggs the insertion of spill code may block additional registers thus requiring more live ranges to be spilled
(see Figure \ref{chaitin-briggs}).
\item Chaiting and Briggs do the life range analysis on instruction level where as Chow-Hennesy perform life range analysis
on basic block level. This is because Chaitin and Briggs do register allocation after instruction selection, but Chow-Hennesy's
algorithm works on the intermediate code level.
\item Chaitin and Briggs do register allocation for all registers, where as Chow-Hennesy only do register allocation outside of basic
blocks.
\end{itemize}
The following paragraph summarizes some concepts that are relevant in connection with register allocation.
\paragraph{Calling conventions:} Calling conventions are conventions about how registeres are used and who is responsible to save and
restore them. There are \textbf{caller-saves} and \textbf{callee-saves} registers. The conventions are a result of a multi-step
proceedure. First the architecture has an impact on the calling conventions, then the operating system and lastly the used programming
language. Calling conventions are part of the ABI and if not taken care of they my cause incomparibilities between programm binaries such
as if libraries are used or programms written in different languages interact.
\paragraph{Pre-coloring:} Pre-coloring is used to assign correct colors to argument registers to avoid unnecessary copy instructions.
\paragraph{Rematerialization:} Some optimizations in Section\, \ref{sec:optimization}\, such as \textit{common subexpression
elemination} optimize a program by removing redundant computation. The disadvantage of such an optimization can the that more symbolic
registers are used or that the live range of a symbolic register is increased. However sometimes it can be cheaper to recompute a value
than to store it in a register. This is often the case if a value can be recomputed with a single instruction. Rematerialization
recomputes a value from another value such that instead of two values only one value needs to remain in a register and the other can be
recomputed when needed. Rematerialization is often used in address calculation such as adding a constant value to an address. For example
in \lstinline|*p = *q + 2;|, if $p$ is used at different locations in a function it can either remain in a register or be recomputed
from $q$ each time it is needed. Rematerialization is commonly done by register allocators when it is easier to recompute a value, than
to keep it in its own register or to reload it from memory.
\section{Coallescing}
Coallescing is a form of copy elemination that finds registers that are copies of each other and removes the copy instruction, thus
putting the values into the same register. Coallescing may affect the colorability of an interference graph (both negative and positive
results are possible).
\begin{itemize}
\item Aggressive coallescing (used by Chaitin): Remove all possible copy instructions, may result in increased spilling.
\item Conservative coallescing: Performs coallescing only when graph colorability is not affected. Was intended to be an improvement but
later analysis showed that the algorithm actually performed worse than aggressive coallescing.
\item Iterative coallescing: Eliminates more copies by interleaving conservative coallescing with simplification phases (like Briggs).
Also has worse results than aggressive coallescing.
\item Optimistic coallescing: Aggressive coallescing, but the algorithm memorizes what has been coallesced. If coloring is not possbile
and nodes need to be spilled, then the coallscing is reversed by live range splitting and then only parts of the nodes are spilled
if possible. \textbf{Best results}.
\end{itemize}
\section{Optimal register allocation using integer linear programming}
Performance overhead of spill code is minimized by using integer linear programming to achieve near optimal results. Since the problem is
NP-complete an optimal solution will be instractable for very complex problems. However the Optimal Register Allocator (ORA) solver still
produces very good results that are very often within 1\% of the optimal solution. However in order to define what optimal means we need
to know which optimizations are being done (or not done) by the register allocator.\\
\noindent \textbf{What does the register allocator do and not do?}
\begin{enumerate}
\item The register allocator does not reorder instructions.
\item The register allocator adds only spill load instrucitons, spill store instrucitons, and rematerialization instructions.
\item The register allocator removes only copy instructions and instructions defining symbolic registers that are rematerialized.
\item All controll-flow paths throguh a function can potentially execute.
\item At a given location, the cost of a spill instruction is a fixed number of cycles (based on the instrucitons' type) times the
location's estimated execution count.
\item A basic block's estimated execution conut equals the sum of the countes exeiting the block, i.e. estimated execution counts
satisfy Kirchhoff's current law over the control flow graph.
\end{enumerate}
The ORA implementation is composed out of three modules. The analysis module analyzes a functions control flow graph and identifies
points which require a decision about register allocation actions. For each decision a boolean decision variable is introduced that
records if the decision is made (1) or not (0) and also what action is taken if the variable is set to 1. These decision variables are
then build into a series of \textbf{inequalities} that model constraints for the decision variables. The solver module then solves these
inequalities and finally the rewrite module rewrites the code according to the results in the decision variables.
\subsection{ORA Modules}
Most of the paper focuses on the \textbf{analysis module} and explains the steps to derive (in)equalities from decision variables. The
steps in the analysis module are as follows:
\begin{itemize}
\item Use live range analysis on the functions control flow graph and build an \textbf{instruction graph} for each symbolic register.
The instruction graph contains solid and dashed lines for each symbolic register to show active and inactive live ranges,
respectively.
\item Apply transformations to create a \textbf{symbolic register graph}. Each transformation can potentially create one or more
conditions. A \textbf{condition} is an equality or inequality function containing decision variables to constrain the valid
assigments to the decision variables. The following gives an example for such a condition:
\begin{align}
x_{defA} &= x_{use-end1A} + x_{use-cont1A}
\end{align}
This condition ensures that if the symbolic register $A$ is allocated to a register at the given instruction, then its use must
either end at that point or it must continue. At this point decision variables are not yet tied to real registers, that happens
in the next step.
\item A copy of each symbolic register graph and its associated set of conditions is made for each \textbf{real register} that is
available for allocation. For example:
\begin{align}
x^1_{defA} + x^2_{defA} &= 1
\end{align}
This condition requires that the symbolic register $A$ is allocated to exactly one physical register, either register $1$ or
register $2$, but not to both or none.
\item Next several additional conditions are added to ensure that the solution is correct and optimal. For example conditions are
added such that not too many or too few symbolic registers are allocated to real registers, or to ensure optimal spill code
placement.
\item Another additional transformation is for rematerialization. If a symbolic register can be rematerialized (see Section
\ref{sec:register-allocation}), then the register allocator adds conditions to allow rematerialization instead of spilling
and loading a symbolic register.
\end{itemize}
When all conditions have been created they are passed to the solver module to find a valid solution for all decision variables.
\subsection{Results}
The experimental results show that the ORA performs significantly better than other register allocators and is able to remove many
copy instructions and reduce the overhead of spill code significantly. There exists a second paper that was written six years later and
discusses improvements to the original ORA implementation. The authors were able to reduce the complexity of the algorithm
from $n^3$ to $n^{2.5}$. They also improved the speed by a factor of 1000x (10x CPU, 10x better solver, 10x improved model).
\section{Interproceedural register allocation}
Interproceedural register allocation (which is also often called global register allocation) performs register allocation accross
proceedures or compilation units and not just within a single function as is done in global register allocation. This solves the problem
that different functions may assign the same global variable to different registers or the same register to different local variables,
which causes unnecessary copy instructions.
\subsection{Global Register allocation at link time (Annotation based)}
Global here means interproceedural accross all compilation units. The register allocation is split into three parts:
\begin{itemize}
\item Compiler
\item Register allocator
\item Linker
\end{itemize}
The compiler does as much work as posible in order to maximize the benefits of seperate compilation such that in the linking phase there
is not too much work required. The compiler inserts annotations into the object code which describe how variables are used. Annotations
provide information to the linker about how to replace variables and how to rewrite the code. The following six annotations are inserted
by the compiler. Each annotation is tied to a variable $v$. When the linker is run, it analyses the object code to identify if $v$ has
been assigned to a register. If $v$ is in a register then the corresponding action is appied, otherwise the annotation is ignored.
\begin{itemize}
\item \makebox[4.3cm]{REMOVE.v\hfill} Remove the instruction
\item \makebox[4.3cm]{STORE.v\hfill} Replace store by copy instruction
\item \makebox[4.3cm]{LOAD.v\hfill} Replace load by copy instruction
\item \makebox[4.3cm]{OP1.v, OP2.v, RESULT.v\hfill} Replace operands or result by register
\end{itemize}
The register allocator generates a call graph and uses the information provided by the compiler assign variables to registers. It does
this by building groups of variables that are not active at the same time and then assigns a usage frequency to those groups. The same
happens for global variables. Finally the groups with the highest freuqencies are allocated to registers. After variables have been
assigned to registers the linker can use this information rewrite the code and to apply the actions that have been stored in the
annotations. While doing this it must take care to adjust addresses of instructions or data if a previous instruction is being removed.
Additional considerations concern initialization and recursion.
\subsubsection*{Initialization}
Special care must be taken to initialize global variables and proceedure parameters. For global variables an INIT action is placed at the
beginning of the program and runs when the program is invoked. The INIT action contains instructions to copy the initial value of global
variables from memory to the respective register.
\subsubsection*{Recursion and indirect calls}
Recursion means that the same proceedure my be called repeatedly befor the first invokation has been completed. Therefore the program
needs to take care to save the necessary registers of local variables before the next invokation of the proceedure. The question is who
should be responsible to save the local varibles of the proceedure. It turns out that it is not enough if each recursive proceedure is
responsible to save its local variables because this would not archive local variables of other proceedures in the call chain. The
solution is to let the proceedure that makes the recursive call save the local variables of all proceedures in the call chain. The same
solution is used for indirect calls.
\subsection{Register allocation across procedure and module bounaries (Web)}
This has two important parts, the web algorithm itself and clustering (e.g. spill code motion).
\subsubsection*{Web algorithm:}
Performs several steps:
\begin{enumerate}
\item Reads in source files and produces intermediate representation files.
\item Creates program database.
\item Uses Program database and transforms intermediate representation files into object files.
\item Linking of the object files.
\end{enumerate}
The web algorithms creates a table with L\_REF, P\_REF, and C\_REF entries from which a web interference graph is generated. This is then
used to apply a coloring on the graph that determines which register is being used for each global variable.
\subsubsection*{Clustering (Splill code motion):}
The idea is to move spill code for callee-saved registers inside a cluster from one node to
its dominating node. If the node is being called more than once, then the overhead of the spill code is reduced. Same for caller-saved
registers which are moved down to lower nodes in the same cluster.
\section{Instruction Scheduling}
Instruction scheduling is an optimization that deals with rearranging the instructions of a program to better exploit the inherent
parallelism of the instructions. The goal is to move instructions that depend on each other further appart and instead insert independent
instructions inbetween. This prevents that dependent instructions block the processor pipeline and allows a more instructions to execute
simultaneously. The simplest form of instruction scheduling is called list scheduling because from a list of possible instruction that
can be scheduled one with the highest priority is selected based on some kind of heuristic. There exist two papers that deal with list
scheduling which both present different heuristics for the problem. The list scheduling algorithms presented in the papers have a worst
case complexity of $O(n^2)$ needed to build the dependency graph. However in practice there exists also a linear time algorithm.
The improved \textbf{linear time algorithm} works as follows. The algorithms goes through each instruction and each time a register is
defined the algorithm saves which instruction defined the register. Each time a register is accessed the corresponding instruction can be
looked up and the algorithm adds a dependency to the graph. The only problem with this algorithm is that the initialization of the
corresponding data structure may be expensive. However this can be optimized when using virtual registers by storing from which
instruction to which instruction a basic block extends. Based on these limits it can be easily determined if an instruction is still
independent. Otherwise the datastructures would need to be rebuild for each basic block.
\subsection{Phase ordering problem:}
A problem between instruction scheduling and register allocation is that both optimizations can have negative effects on the other one,
thus it is difficult to decide what should be done first -- register allocation or instruction selection. This problem is called the
\textit{phase ordering problem}. The list scheduling algorithms below attempt different solutions but still perform both optimizations
separately. A better solution is discussed in the next Section which presents algorithms that perform instruction scheduling and register
allocation together at the same time.
The impact of each optimization to the other is as follows:
\begin{itemize}
\item Register allocation serializes the code and removes opportunities for scheduling (less instruction level parallelism).
\item Instruction scheduling increases the length of the live ranges (register pressure), causing more interferences and thus requires
the use of more registers or increased spilling.
\end{itemize}
\subsection{Instruction scheduling for the IBM RISC System/6000 Processor (IBM)}
The algorithm which is described in this paper performs the instruction scheduling twice, once before register allocation and once
afterwards. The first instruction scheduling has more freedom since register allocation has not yet been done, but register allocation
may insert spill code and delete move instruction which make a second run of instruction scheduling necessary. The algorithm itself works
as follows:
\begin{itemize}
\item It uses a dependency graph which is constructed for each basic block. The dependency graph has one node for each instruction and
adds a directed edge between two instructions if one needs to preceed the other.
\item Edges are labeled with the delay between instructions.
\item After the dependency graph has been build, nodes are labled upwards with the \textbf{maximum delay} from the node to the end of
the basic block (red digits in Figure \ref{dependency-graph}).
\item The algorithm keeps a \textbf{current time} counter, which is initially zero. It is increased by the execution time of each
scheduled instruction, which is usually one.
\item Nodes are labels with \textbf{earliest time} value which is calculated from the current time of the last scheduled predecessor
plus the delay of that instruction to its successor. For example the node L.B has a delay of one and is scheduled in step 2 where
the current time is increased to 2, so the ADD node would be labled with an earlies time of 3 (2+1).
\end{itemize}
Figure \ref{dependency-graph} shows the the dependency graph from the paper but adds the values of the maximum delay and a table that
shows how the current time is updated and which instruction is selected in each step.
\begin{figure}[h!]
\begin{tikzpicture}[node/.style={circle, draw, thick, minimum size=1.0cm,inner sep=0pt,font=\footnotesize,align=center},
item/.style={inner sep=0pt,font=\footnotesize,align=left},
edge/.style={->,>=stealth,line width=2pt},auto,
red/.style={draw=red},black/.style={draw=black},
legend/.style={draw=black}
]
\node[node] (la) at (1, 7) {L.B \\ \textcolor{red}{1}};
\node[node] (lc) at (3, 7) {L.C \\ \textcolor{red}{1}};
\node[node] (ld) at (4, 5) {L.D \\ \textcolor{red}{1}};
\node[node] (le) at (8, 3) {L.E \\ \textcolor{red}{4}};
\node[node] (add) at (2, 5) {ADD \\ \textcolor{red}{0}};
\node[node] (sub) at (3, 3) {SUB \\ \textcolor{red}{0}};
\node[node] (store) at (4, 1) {ST.A \\ \textcolor{red}{0}};
\node[node] (compare) at (8, 1) {CMP \\ \textcolor{red}{3}};
\node[node] (branch) at (6, 0) {BC \\ \textcolor{red}{0}};
\draw[edge] (la) -- node [swap] {1} (add);
\draw[edge] (lc) -- node {1} (add);
\draw[edge] (ld) -- node {1} (sub);
\draw[edge] (add) -- node [swap] {0} (sub);
\draw[edge] (sub) -- node {0} (store);
\draw[edge] (store) -- node {0} (branch);
\draw[edge] (le) -- node {1} (compare);
\draw[edge] (compare) -- node {3} (branch);
\node[legend] (foo) at (11,6) {
\begin{tikzpicture}
\node[item] (step1) at (0, 0) {1. Current time: $0\rightarrow1$;\quad select L.E\\
2. Current time: $1\rightarrow2$;\quad select L.B\\
3. Current time: $2\rightarrow3$;\quad select CMP\\
4. Current time: $3\rightarrow4$;\quad select L.C\\
5. Current time: $4\rightarrow5$;\quad select L.D\\
6. Current time: $5\rightarrow6$;\quad select ADD\\
7. Current time: $6\rightarrow7$;\quad select SUB\\
8. Current time: $7\rightarrow8$;\quad select ST.A\\
9. Current time: $8\rightarrow9$;\quad select BC};
\end{tikzpicture}
};
\node[legend] (foo) {
\begin{tikzpicture}
\node (redsquare) at (-2, -1) {\textcolor{red}{0}};
\node (redlegend) at (0, -1) {Maximum delay};
\end{tikzpicture}
};
\end{tikzpicture}
\caption{Dependency graph constructed by IBM instruction scheduling algorithm}
\label{dependency-graph}
\end{figure}
When deciding which node to select in each step, the algorithm uses an eight step heuristic. Since there is no planning or lookahead
involved the algorithm is not guaranteed find the best solution. There are the following seven heuristics:
\begin{enumerate}
\item Initialize the set of all those instructions that have not yet been selected, and that have no predecessors in the dependency
graph (these are the ''legal`` ones).
\item Refine the subset to those instructions whose earliest time has arrived or, if none, those with the smallest earliest time.
\item If one or more instructions have been selected, and if the current subset contains one or more instructions of opposite type
(fixed/floating) from the last one selected, then refine the current subset to those of this opposite type.
\item Refine the subset to those of maximum total delay along the path of each instruction to the end of the basic block.
\item Refine the subset to those of minimum ''liveness weight``.
\item Refine the subset to those with greatest uncovering''.
\item Refine the subset to the unique instruction that came first in the original ordering.
\end{enumerate}
\subsection{Efficient Instruction Scheduling for a Pipelined Architecture (HP)}
The general idea is similar to the IBM algorithm. The algorithm works as follows:
\begin{enumerate}
\item Build a scheduling dag over the basic block
\item Put the roots of the dag into a candidate set
\item Select the first instruction to be scheduled based on the heuristics below
\item While the candidate set is not empty do the following:
\begin{enumerate}
\item based on the last instruction scheduled and the heuristics select and emit the next instruction that should be scheduled
\item delete the instruction from the dag and add any newly exposed candiates
\end{enumerate}
\end{enumerate}
The following heuristics use the following criteria to determine the priority with which a candidate should be scheduled:
\begin{enumerate}
\item Whether an instruction interlocks with any of its immediate successors in the dag.
\item The Number of the immediate successors of the instruction
\item The length of the longest path from the instructions to the leaves of the dag.
\end{enumerate}
The instruction scheduling in is performed before register allocation. The paper mentions register allocation but claims that
''serializing definitions does not unduly restrict [the] code``.
\section{Register allocation \& instruction scheduling}
This section introduces three methods which combine instruction scheduling and register allocation in order to solve the phase
ordering problem. The first paper presents two algorithms the Integrated Prepass Scheduling (IPS) and the dag-driven register allocation.
The second paper introduces a method named Register Allocation with Scheduling Estimate (RASE).
\subsection{Integrated Prepass Scheduling}
The main issue of the phase ordering problem is that register allocation reduces opportunities for scheduling while instruction scheduling
increases the length of live range making register allocation more difficult. The IPS algorith solves this problem by using two strategies
that can be selected depending on what strategy is more appropriate. The two strategies are:
\begin{itemize}
\item CSP (Code scheduling for pipelined processors) reduces pipeline delays but can increase the lifetime of registers. It is used if
enough free registers are available.
\item CSR (Code scheduling to minimizes register usage) is used when the number of registers is low to control register usage. It tries
to ''find the next instruction which will not increase the number of live register or if possible, decrease that number``.
\end{itemize}
IPS uses a simple heuristic to switch between the two strategies, it keeps track of the number of available registers in a variable named
AVLREG, if the number of free registers drops below a certain threshold, then it switches to CSR. When AVLREG has increased above the
threshold then it switches back to CSP.
\subsection{Dag-driven register allocation}
Dag-driven register allocation is a form of postpass scheduling to keep the graph flat and wide. It also does not insert any
additional spilling instructions into the code. The algorithm introduces two measures for the graph the width and the height:
\begin{itemize}
\item The \textbf{width} is defines as the maximum number of mutually independent nodes. Wider graphs indicate higher parallelism.
\item The \textbf{height} is the length of the longest path. If the graph is too high then code scheduling becomes less efficient.
\end{itemize}
The goal of the dag-driven register allocator is to balance the dag. To do that it will \textbf{minimize the height} of the graph and
\textbf{limit the width} of the graph to a number that is less or equal to to the amount of registers available.
The dag-driven register allocator uses two techniques to achieve these goals:
\begin{itemize}
\item First, it makes use of free WAR dependencies. That is, it makes use of redundant dependencies when allocating registers. Such
redundant dependencies can occur for example if a RAW dependency already exists between two instructions. In the Listing below
assume registers \lstinline|R2| and \lstinline|R4| are available for reuse in the second instruction. There is already a RAW
dependency for \lstinline|R5|, so when register \lstinline|R4| is choosen a redundant WAR dependency is added. This WAR
dependency is free because of the existing RAW dependency for \lstinline|R5|, so \lstinline|R4| is a better choice than
\lstinline|R2|. When reusing registers, the dag driven register allocator will prefer those registers that only introduce redundant dependencies.
\begin{lstlisting}[xleftmargin=3.5mm]
Add R5, R4, R1
Sub ??, R5, #4
\end{lstlisting}
\item Second, it tries to balance the growth of the dag. When a register is reused and the register allocator cannot make use of a
free WAR dependency then it must add a dependency which connects two independent paths and the height of the dag will be
increased. So the important part in balancing the dag is to find a path to connect the path of the current instruction, such that
it does not increase height of the dag too much. To achieve this for each instruction the \textit{earliest issue time (EIT)} and
\textit{earliest finish time (EFT)} are computed. The register allocator now tries to find two paths such that one has a high
EIT and the other has a large EFT.
\end{itemize}
\subsection{Register Allocation with Scheduling Estimate (RASE)}
The RASE approach is an integrated scheduling and register allocation method. The idea behind RASE is to perform a pre-scheduling in
order to calculate schedule cost estimates that enable the register allocator to make a decision about how many registers it is going to
use. Essentially RASE splits the available registers into two parts, one part for the register allocator, and one part for the
instruction scheduler. The RASE algorithm has three phases:
\begin{enumerate}
\item Pre-scheduling
\item Global register allocation
\item Instruction scheduling with local register allocation
\end{enumerate}
The \textbf{first phase} is the pre-scheduling, which calculates the schedule cost estimate. During the pre-scheduling, the order of the
instructions is not changed.
The \textbf{second phase} performs a \textit{global} register allocation using the schedule cost estimate, but leaves enough free
registers for the last phase. The schedule cost estimate allows the register allocator to estimate, how the cost of local register
allocation increases, if it uses more registers to allocate \textit{global} symbolic registers. In the second phase the register
allocator is only responsible for allocating global symbolic registers to physical registers, the allocation of \textit{local} symbolic
registers is defered until the third phase.
The \textbf{third phase} performs the instruction scheduling and does the \textit{local} register allocation. This register allocation
must be within the limits for \textit{local} registers that have been determined in the second phase.
\textbf{How does the schedule cost estimate function work?} The schedule cost estimate is a quadratic function (of \textit{regressive}
form). It takes as input the number of local registers and returns as output the estimated machine cycles:
\begin{figure}[h!]
\begin{center}
\begin{tikzpicture}
%\draw[->] (-0.5,0) -- (4.2,0) node[right] {$x$};
%\draw[->] (0,-0.5) -- (0,4.2) node[above] {$y$};
\begin{axis}[xmin=3,xmax=22,ymin=200,ymax=600,samples=55,y=0.017cm,
xlabel={Number of local registers},
ylabel={Estimated machine cycles}]
\addplot[blue, smooth, ultra thick, domain=3:22] (x,{220+3200/(x*x)});
\end{axis}
%\draw[scale=0.5,domain=1:8,smooth,variable=\x,blue] plot ({\x},{1 + (7 / \x^2)});
\draw[<-,>=stealth,line width=2pt] (1,2.2) -- (2,3.2) {};
\node (formula) at (4.1,3.4) {$schedcost_b(x)=c+d/{x^2}$};
\end{tikzpicture}
\end{center}
\caption{Schedule cost estimate function}
\label{schedule-cost-estimate-function}
\end{figure}
The coefficientes $c$ and $d$ are computed as part of the pre-scheduling. Essentially this function allows the register allocator to
estimate how much the required machine cycles are incresed, due to loss of parallelism, if the available local registers are reduced.
As can be seen in Figure \ref{schedule-cost-estimate-function}, if the register allocator leaves 15 registers to the third phase,
then the total execution cost for that basic block it not very high, however if only 4 or 5 registers are left to the third phase,
then the cost increases quite significantly.
\subsection{Which algorithm should be prefered?}
When comparing the three algorithms one can say that IPS is the ''best``
algorithm, because it is realively easy to implement and produces results which are comparable to the other two solutions. The IPS
algorithms can be easily implemented in existing register allocators such as Chaitin or Briggs. In contrast the DAG driven algorithm and
RASE are both quite complicated to implement but do not produce significantly better results.
\section{Trace scheduling}
Trace scheduling attempts to provide another solution for the phase ordering problem. A trace is \textbf{a sequence of instructions that
can have exit and entry points but that must not have any loops}. The trace
scheduler orders traces by their frequency and schedules the most frequent trace first. Inner most loops usually contain
the traces with the highest frequency and are thus scheduled first. Since the instruction scheduler is also responsible for assigning
variables to registers, this scheduling approach effectively gives priority to those variables, which are used with the highest frequency.
The advantage of this approach is that at the beginning all registers are available for use by the scheduler, giving the scheduler
complete choice over registers with the highest frequency.
Since traces are scheduled individually of each other care must be taken if two traces merge with each other or split from each other.
This can either happen at split nodes, when one trace leaves from the other, or at join nodes, when one trace joins into another. Since
the traces are scheduled and register allocated individually there needs to be a machanism to communicate register allocation decisions
between traces. This is done by adding special nodes named \textit{Value Location Mapping} (VLM).
\subsection{Variable Location Mapping (VLM):}
A VLM is placed at the split or join nodes between two traces and contains the information about already allocated variables. So when a
trace is scheduled that contains a variable which has already been allocated, then the scheduler can take this decision into account. It
tries to ensure that individually scheduled traces reference their values at the same locations to avoid unnecessary data motions such as
register moves, spills and restores. If a VLM appears at the beginning of a trace, then the scheduler must make sure to read values that
appear in the trace from the locations specified in the VLM. On the other hand, if a VLM appears at the end of a trace, then the
scheduler must make sure to \textit{satisfy} the references in the VLM, that is, it must store values that appear in the trace at the
locations specified by the VLM. Sometimes it may not be possible to satisfy the constraints in a VLM. This can happen if there is a VLM
both at the top and bottom of a trace and both VLMs have conflicting locations for the same variable. If it is not possible to satisfy
the location constraints in a VLM, then the scheduler must insert \textbf{compensation code} to move or copy values to the correct
locations.
\subsection{Delayed binding}
Delayed binding is used because some values have live ranges that go through a trace but the value is not actually used inside the
trace. \textit{''A delayed binding is a sort of pseudo-location that may be assigned to an unreferenced value by the instruction
scheduler.''} Delaying the variable binding until there is an actual reference to it helps to preserve registers for more important
values. When a delayed binding is bound to an actual location it is determined if there exists an unused register that is free through out
the whole live range of the value, otherwise the value is bound to memory.
\textbf{Paper Evaluation:} The evaluation of the trace scheduling paper is not useful, because the results are only compared to
themselves.
\section{Software pipelining}
Software pipelining is a form of instruction scheduling with the goal of scheduling the different instructions of a loop in a way, such
that multiple loop iterations can be active at the same time.
\subsection{Modulo Variable Expansion}
One problem in software pipelining is that a variable is defined and then used two cycles later. If
this variable remains in the same register during every loop iteration, then each iteration must take two cycles. However if copies of
the variable are kept in different registers for consequtive iterations, then it is possible to start a new operation in each cycle.
This concept of using multiple registers for the same variable is called \textbf{modulo variable expansion}.
%TODO: Which problems occur du? What is it? Why is it needed?
\subsection{Resource-constrained software pipelining}
This algorithm provides a solution to the software pipelining problem by combining instructions into sets of available instruction which
are then scheduled. The algorithm is very elegant but has the grave problem that the algorithm is only shown graphically, the actual
\textbf{heuristic} to choose which instructions can be combinede into the sets and how to schedule them \textbf{is missing} and thus the
algorithm cannot be implemented.
\subsection{Iterative modulo scheduling:}
Different scheduling strategies exist in software pipelining, the iterative modulo scheduling is a form of the ``schedule-then-move''
stategy. The goal of iterative modulo scheduling is to find a schedule for the instructions of a loop body, such that the loop
can be repeated at regular intervals, without causing any conflicts between the instructions. There are two types of conflicts that can
occur, a resource conflict or a dependency conflict:
\begin{itemize}
\item Resource conflict: The same processor resource such a bus or a pipeline stage are in use at the same time.
\item Dependency conflict: Can be an inter-iteration-dependency if there exists a dependency between instructions in two different
iterations of a loop or an intra-iteration-dependency if there is a dependency between two instruction in the same loop iteration.
\end{itemize}
Before the actual iterativev modulo scheduling is performed several other optimizations and transformations are performed, one of them is
IF-conversion. \textbf{IF-conversion} removes all branches except the loop closing branch and converts the code into a single basic block. The different branches of the original code are instead expressed by data dependencies involving predicates.
Iterative modulo scheduling introduces the notion of an \textbf{initiation intervals (II)} which is the interval between the start of a
new loop iteration. An II of four means that every 4-th instruction a now loop iteration begins. A \textbf{minimum initiation interval
(MII)} is calculated and used to calculate the minimum length of a schedule. To compute the actual II a
candidate II is used that is initialy set to the MII and then step wise incremented until a suitable II is found. The MII is the maximum
of the Resource constraint MII and Recurrence constraint MII values:
\begin{itemize}
\item The \textbf{Resource-constrained MII} used to avoid that two instructions block the same processor resource at the same time. For
example if an add instruction requires to use the result bus in the fourth cycle and a multiply instruction requires the result bus
in the sixth cycle, then an add cannot be sheduled two cycles after a multiply. The ResMII can be computed by performing a
bin-packing of the reservation tables for all the operations.
\item The \textbf{Recurrence-constrained MII} is used prevent data dependencies between the same operations in different loop iterations.
% TODO: How does this work?
\end{itemize}
The iterativ modulo scheduling starts with the initially computed II and tries to find a valid schedule for this II and a given budget. If
no schedule can be found the II is increased by one and the proceedure is repeated until a schedule is found. There are some differences
between the traditional acyclic list scheduling and iterative modulo scheduling:
\begin{itemize}
\item Instructions can also be unscheduled and rescheduled, thus operation scheduling instead of instruction scheduling is performed.