-
Notifications
You must be signed in to change notification settings - Fork 0
/
preliminaries.tex
1486 lines (1294 loc) · 78.2 KB
/
preliminaries.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
This chapter gives an introduction to program analysis, parallel programs, and relaxed memory models.
We will also introduce the C++ programming language as it is the language we are primarily targeting and the DIVINE model checker as all the contributions are implemented in DIVINE.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Program Analysis} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We will not be concerned with simple program analysis techniques like type-checkers and linters.
Instead, we will focus on techniques based on formal verification, that is techniques which can provide some formally described guarantees about the program if they succeed.
These techniques usually require a program and some specification the program should adhere to and can check if that is the case.
We will also focus mostly on techniques which can be directly applied to programs written in some mainstream programming language (i.e., techniques which do not require special verification-oriented languages as their inputs).
Program analysis techniques can be broadly divided into two different areas, \emph{automatic techniques} that, provided a program and its specification, should produce a result automatically without further assistance, and \emph{human-assisted techniques} which require a substantial human effort during the analysis.
Examples of the first area are symbolic execution and various model checking techniques, while theorem provers are an example of the latter kind.
As we focus on techniques that should be usable by programmers without a substantial background in formal logic, we will focus on the automatic tools.
\subsection{State Space}
To be able to define program properties that should be checked and to describe various analysis methods, we first need to define the \emph{state space} of a program.
\begin{definition}[State Space]
The \emph{state space} of a program is a directed multigraph with (optionally) labelled edges that describes all the ways in which the program can be executed.
The vertices of the state space multigraph are called \emph{states} (of the program).
Each state represents a certain point in the execution of the program -- it can be described by a snapshot of the program (its memory, program counters and stacks of all its threads, …).
States $v_1, v_2$ are connected by an edge in the state space if $v_2$ can be reached from $v_1$ by a single step of the program.
The edge can be optionally labelled, for example by the statements executed in each step, or by selected actions specified by the program or verification tool (such as \emph{error} label to indicate an error occurs on this edge or an \emph{accepting} label that can be used in the automata-based approach to LTL model checking).
The state space contains all the states of the program reachable from an initial state (an initial configuration of the program).
\end{definition}
The input for the analysis is usually not the state space graph.
Typically, the state space is specified by its implicit representation -- the program code, or possibly a function that describes the initial states and how to get from one state to its successors.
In practice, the state space of a program can be very large or infinite, and it
is useful to be able to consider only a representative part of the state space.
\begin{definition}[Reduced State Space]
A \emph{reduced state space} of a program is a subset of the state space in which states are subset of states of the original state space and edges connect states between which there is a directed path in the original state space such that all the internal states of this path are not contained in the reduced state space.\mnote{
For example, consider the following state space:\\
\begin{tikzpicture}[>=latex,>=stealth',auto,node distance=1em,semithick,initial text=,
state/.style={inner sep=3pt, draw, circle}]
\node[state, initial, initial above] (i) {};
\node[state, below left = of i] (a) {};
\node[state, below right = of i] (b) {};
\node[state, below left = of a] (aa) {};
\node[state, below right = of a] (ab) {};
\node[state, below right = of b] (bb) {};
\node[state, below right = of aa] (aab) {};
\node[state, below right = of ab] (abb) {};
\node[state, below right = of aab] (aabb) {};
\path[->, shorten >=1pt]
(i) edge (a)
(i) edge (b)
(a) edge (aa)
(a) edge (ab)
(b) edge (ab)
(b) edge (bb)
(aa) edge (aab)
(ab) edge (aab)
(ab) edge (abb)
(bb) edge (abb)
(aab) edge (aabb)
(abb) edge (aabb)
;
\end{tikzpicture}
A corresponding reduced state space might be:\\
\begin{tikzpicture}[>=latex,>=stealth',auto,node distance=1em,semithick,initial text=,
state/.style={inner sep=3pt, draw, circle},
rem/.style={state, thin, dashed, draw=gray},
]
\node[state, initial, initial above] (i) {};
\node[state, below left = of i] (a) {};
\node[rem, below right = of i] (b) {};
\node[state, below left = of a] (aa) {};
\node[rem, below right = of a] (ab) {};
\node[rem, below right = of b] (bb) {};
\node[state, below right = of aa] (aab) {};
\node[rem, below right = of ab] (abb) {};
\node[state, below right = of aab] (aabb) {};
\path[->, shorten >=1pt]
(i) edge (a)
(i) edge[rem] (b)
(a) edge (aa)
(a) edge[rem] (ab)
(b) edge[rem] (ab)
(b) edge[rem] (bb)
(aa) edge (aab)
(ab) edge[rem] (aab)
(ab) edge[rem] (abb)
(bb) edge[rem] (abb)
(aab) edge (aabb)
(abb) edge[rem] (aabb)
;
\end{tikzpicture}
Another possibility is:\\
\begin{tikzpicture}[>=latex,>=stealth',auto,node distance=1em,semithick,initial text=,
state/.style={inner sep=3pt, draw, circle},
rem/.style={state, thin, dashed, draw=gray},
]
\node[state, initial, initial above] (i) {};
\node[rem, below left = of i] (a) {};
\node[rem, below right = of i] (b) {};
\node[state, below left = of a] (aa) {};
\node[rem, below right = of a] (ab) {};
\node[state, below right = of b] (bb) {};
\node[rem, below right = of aa] (aab) {};
\node[rem, below right = of ab] (abb) {};
\node[state, below right = of aab] (aabb) {};
\path[->, shorten >=1pt]
(i) edge[rem] (a)
(i) edge[rem] (b)
(a) edge[rem] (aa)
(a) edge[rem] (ab)
(b) edge[rem] (ab)
(b) edge[rem] (bb)
(aa) edge[rem] (aab)
(ab) edge[rem] (aab)
(ab) edge[rem] (abb)
(bb) edge[rem] (abb)
(aab) edge[rem] (aabb)
(abb) edge[rem] (aabb)
;
\path[->, shorten >=1pt]
(i) edge (aa)
(i) edge (bb)
(aa) edge (aabb)
(bb) edge (aabb)
;
\end{tikzpicture}
}
\end{definition}
Several techniques can be used to construct reduced state space in such a way that a given property (or a class of properties) holds in the reduced state space if and only if it holds in the original state space.
For example, partial order reduction~\mcite{Peled1993}, dynamic partial order reduction~\mcite{Flanagan2005dpor} or $\tau+$ reduction~\mcite{RBB13}.
On top of the reductions, it is also possible to build an \emph{abstraction} of a state space, i.e., a state space that preserves some of the properties of the original but abstracts away some of its complexity.
These abstractions are usually built from the implicit representation of the program.
Abstractions are often used to reduce large or infinite state spaces to sizes that can be managed.
Usually, abstractions are designed such that results derived from the abstract state space can be used to obtain some information about the original state space.
\begin{definition}[Over- and Under-Approximations in Abstraction]
Suppose we have an abstracted state space $\widehat{S}$ that corresponds to a (concrete) state space $S$.
Suppose further that we are interested in the reachability of states that satisfy some property $P$ that can be evaluated both in $S$ and $\widehat{S}$.
\begin{itemize}
\item We say that $\widehat{S}$ is an \emph{over-approximation} of $S$ if the fact that no state that satisfies $P$ is reachable in $\widehat{S}$ implies that no such state is reachable in $S$ (i.e., $\widehat{S}$ can contain more behaviour that leads to such a state).
\item We say that $\widehat{S}$ is an \emph{under-approximation} of $S$ if the fact that a state that satisfies $P$ is reachable in $\widehat{S}$ implies that there is a state that satisfies $P$ reachable in $S$ (i.e., $\widehat{S}$ can contain less behaviour that leads to such a state).
\end{itemize}
\end{definition}
Abstractions are also connected with the notion of \emph{refinement}.
Often a program is over-approximated, then analysed, and if an error is found, this error is validated.
If the error is valid, then there is a real error in the original program.
However, if the error is not valid, the error trace is used to \emph{refine} the abstraction, i.e., make it more precise to rule out this spurious error (and preferably also similar spurious errors).
This technique is often called \emph{Counterexample-Guided Abstraction Refinement} (\emph{CEGAR}) \mcite{Clarke2000, Kurshan1995}.
An example of an abstraction technique is \emph{predicate abstraction} \mcite{Hu1998}.
Instead of tracking original values of variables, program abstracted with predicate abstraction tracks only validity of predicates over the variable values.
For example, it might track predicates such as $(x > 0)$ or $(y \le x)$.
Refinement with is usually done by addition of new predicates.
\medskip
Finally, we will need the following state-space-related definitions.
\begin{definition}[Finite State Space]
We say that a program has a \emph{finite state space} if the number of vertices and edges in the state space multigraph is finite.
Otherwise, we say the program has an \emph{infinite state space}.
\end{definition}
\begin{definition}[Run]
A \emph{run} in the state space is a (possibly infinite) path in the state space (or reduced state space) that starts in the initial state.
That is, a run is a sequence of states $\sigma = s_0, s_1, …$ such that $s_0$ is the initial state and for each consecutive pair of states $s_i, s_{i+1}$ there is an edge from $s_i$ to $s_{i + 1}$ in the corresponding (reduced) state space.
States can be repeated in a run and a run can be infinite.
\end{definition}
\subsection{Program Properties}
To be able to find errors in programs, an analysis tool has to have
some specification of program correctness -- a \emph{property} that should be
checked.
In this section, we will introduce several categories of program properties
together with examples of particular properties.
\paragraph{Safety Properties}
%
are properties that can be checked locally -- for each state of the program, we
can determine if the state satisfies or violates the given property.
Therefore, to conclude the program is error-free under given safety property, it suffices to show that no state violates the property.
Examples of safety property include \emph{memory safety} (every
time we access a particular part of memory, this memory is allocated and
accessible to the program), \emph{assertion safety} (a program
does not violate any of the assertions in the source code), and
\emph{control-flow definedness} (every time a conditional jump is
performed, the values the jump is based on must have well-defined value).
Some program properties can be checked locally only if sufficient information
is kept in the program state.
For example, checking \emph{the absence of memory leaks} requires that it is
possible to enumerate all allocated objects and all objects to which there are
usable pointers.
Similarly, mutex-related deadlocks can be detected by safety analysis if the program keeps track of the graph of waiting for mutexes.
\paragraph{Temporal properties}
%
Not all properties can be directly described as safety properties.
For example, we can have a property stating ``every time a button is pressed, the elevator will eventually drive to the floor the button was pressed on''.
Such a property cannot be checked directly in one state of the program, but it is a property of \emph{runs} of the program.
Various temporal logics can be used to describe such properties, for example
LTL, CTL, CTL* \mcite{Baier2008} or $\mu$-calculus \mcite{Kozen1982}.
\paragraph{Termination Properties}
%
A distinguished kind of temporal properties are termination properties, which are properties which allow us to specify that a program must (or must not) terminate, or that some of its parts must (or must not) terminate.
With tools that check for termination, we often distinguish \emph{termination analysis}, which is an analysis that aims to prove termination but is often unable to conclude that a program does not terminate, and \emph{nontermination analysis}, which aims to prove that the program does not terminate.
This distinction is useful as the heuristics used for proving termination and nontermination are often different, and therefore a single analyser might not possess both capabilities.
\subsection{Hardware and Platform Related Considerations}
When a program is executed on given hardware, its behaviour might
depend on the concrete features of the hardware.
Therefore, for the program analysis to be usable, one must know what
hardware model(s) the program analysis tool adheres to and what is their
relation to the real hardware the program will be executed on.
In practice, it is often hard to prove that a program will be correct on
\emph{any} hardware on which it will be possible to compile it.
For example, the size (and therefore precision) of the standard C/C++ type
\cpp{int} might depend on the hardware the program is using: on an 8-bit or
16-bit embedded microprocessor, \cpp{int} will likely be a 16-bit type with a
maximum value of $2^{15} - 1$, while on a common 64-bit (or 32-bit) computer it
will be a 32-bit number with the maximum value of $2^{31}-1$.
Therefore, a code that causes a bug due to integral overflow on a 16-bit
embedded microprocessor might be correct on a 64-bit machine.
Another example concerns the \emph{alignment} of values in memory.
Often it is
possible to address single bytes in memory, but certain data types (larger
than a byte) are only allowed to start on addresses divisible by a given
\emph{alignment}.
For example, an alignment of a 32-bit \cpp{int} type might be 4 bytes (32
bits).
Depending on the hardware platform, reading and writing unaligned values might
work (e.g., on x86 and x86-64), might trigger an error (e.g., on some ARM
processors) or the address might be silently rounded to the nearest
aligned address (e.g., on some embedded ARM processors).
Therefore, the behaviour of the same program with an unaligned read can differ
drastically depending on the platform on which it is executed.
Furthermore, some features of the program are affected not only by the
hardware, but also by the operating system (or more generally the platform) for
which it is compiled.
For example, the calling conventions of C programs compiled on Windows and
Linux differ even on the same hardware.
Similarly, sizes of data types might depend on the platform -- on Windows C++ \cpp{long} is 32 bits long even on 64-bit systems (and therefore cannot be used
to hold values of pointers), while it is 64 bits long on 64-bit Linux systems.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Parallelism \& Threading Model} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Parallelism is an essential part of programming high-performance software that
can fully take advantage of the current hardware.
However, parallelism comes with additional problems not present in the development of sequential software.
Namely, due to the need for synchronisation, it is significantly harder to
create correct parallel software compared to the development of sequential
software.
Furthermore, it is also hard to create parallel software that scales well with
the number of processors (or processor cores) available.
Improving scalability usually makes the problem of correctness even harder --
it often requires more fine-grained (and therefore error-prone)
synchronisation.
For example, fine-grained locking or lock-free programming can lead to
high-performance code, but checking that the code uses synchronisation properly
is significantly harder than for code that uses a few locks with well defined
regions of shared memory they guard.
Furthermore, relaxed memory behaviour comes into play once shared variables are
not only accessed in critical sections (i.e., protected by mutexes).
As processor manufactures strive to increase the speed of processors, they have
introduced various optimisations into the memory infrastructure to avoid
waiting for the relatively slow main memory.
These optimisations can be (and in most processors are) visible to a parallel
program and result in relaxed memory behaviour.
For example, on x86 processors (which are used in most laptop, desktop and
server computers) a memory store can be delayed and appear, from the point of
view of the other threads, later than a subsequent load.
Some processors (for example POWER and higher-performance ARM processors) allow
even more reordering, for example, reordering of independent loads.
In this section, we describe the basic model of parallelism we assume for our
programs and follow some of its implications.
We also outline the basics of relaxed memory behaviour and memory models
which describe it.
\subsection{Basic Threading Model}
We assume that a program consists of one or more threads that interact using
shared memory.\mnote{Other communication methods can be emulated using shared
memory. Furthermore, many programming languages, including C++, have no native
support for other communication schemes.}
We also assume that execution of a thread can be interrupted at any point by a
\emph{thread scheduler} (of the operating system on which the program runs).
The execution will be later eventually resumed (unless the program exits in the meantime or the thread is \emph{blocked} infinitely).
A thread can be \emph{blocked} if it waits for a resource provided by the
operating system, for example, availability of a lock or input from an input
device.
If a thread cannot be resumed because it is blocked and does never cease to be
blocked we say there is a \emph{deadlock}.
\begin{definition}[Deadlock, Partial Deadlock, Global Deadlock]\label{def:deadlock}
A \emph{deadlock} happens if a scheduler never allows a thread to run
because it waits for a resource that never becomes available.%
\begin{marginnote}
For example, suppose a program with threads \texttt{t0} and \texttt{t1}:
\medskip
\begin{cppcode}
mutex mtx0, mtx1;
void t0() {
unique_lock(mtx0);
unique_lock(mtx1);
/* ... */
}
void t1() {
unique_lock(mtx1);
unique_lock(mtx0);
/* ... */
}
\end{cppcode}
\smallskip
If \texttt{t0} locks \texttt{mtx0} and \texttt{t1} locks \texttt{mtx1} then each of the threads will attempt to lock the mutex already owned by the other thread.
Therefore, the threads are in deadlock -- neither of them can proceed and both are blocked.
\end{marginnote}
Sometimes, when we need to distinguish between all the threads being
blocked and only some of them being blocked, we will use the notion of
\emph{partial deadlock} to signify a situation in a program where some
threads are blocked (in deadlock), but others can still proceed and
\emph{global deadlock} for a deadlock which blocks all threads.
Usually, we will just use \emph{deadlock} to refer to \emph{partial deadlock}.
\end{definition}
It is important to note that our definition of deadlock does not include
\emph{busy waiting} -- i.e., a situation in which a thread is executing a loop
that tests that some event occurred.
Busy waiting can be used for example to implement exclusive sections without
operating system support.
\begin{definition}[Livelock]\label{def:livelock}
\emph{Livelock} is a situation in which a thread is allowed to run (by
the scheduler), but does not proceed in any meaningful way because it
executes a loop that is waiting for an event that never happens.%
\begin{marginnote}
Suppose a program with threads \texttt{t0} and \texttt{t1}:
\medskip
\begin{cppcode}
atomic<bool> flag;
void t0() {
while (!flag) {
/* wait */
}
/* ... */
}
void t1() {
flag = true;
/* ... */
flag = false;
/* ... */
}
\end{cppcode}
\smallskip
If \texttt{t0} starts executing the \texttt{while} loop only after \texttt{t1} sets \texttt{flag} to \texttt{false} the loop will never end, and \texttt{t0} is in livelock -- it can execute, but does not proceed (provided \texttt{flag} is not set again).
\end{marginnote}
\end{definition}
\paragraph{Interleaving of Threads}
In practice program threads can run concurrently, i.e., multiple cores of the
processor can execute different threads at the same instant in time.
Furthermore, the scheduler can map threads to processor cores arbitrarily, and
it can even change the core that executes a given thread.
It is not practical (and not necessary) to simulate this behaviour when
performing an analysis of a parallel program.
Instead, we consider that threads are interleaved (\emph{interleaving
semantics}) or we consider execution based on some relaxed memory model.
\begin{definition}[Interleaving Semantics of Threads]
With the \emph{interleaving semantics}, we assume that all possible
executions of a parallel program can be obtained by interleaving.
That is, at any point in the execution of the program, we consider which
threads are allowed to run and explore all possible selections.
After a thread performs an action, we again consider all possible actions
of all threads and so on.
\end{definition}
\subsection{Relaxed Memory Models}
A \emph{memory model} describes the behaviour of memory operations on a given platform.
A \emph{relaxed memory model} describes such behaviour on a platform with relaxed memory.
When considering relaxed memory models, it is also useful to consider a memory
model that is not relaxed and corresponds to the interleaving semantics of
threads.
This memory model is \emph{sequential consistency}.
\begin{definition}[Sequnetial Consistency]
\emph{Sequential consistency} is a memory model corresponding to the
interleaving semantics of threads.
Under sequential consistency, all effects of memory manipulating
instructions are visible to all threads immediately, and no instruction
reordering is observable.
\end{definition}
The operations semantics of sequential consistency is given by a machine that has no instruction reordering, and all memory operations are atomic and access memory directly, without any caches.
Sequential consistency is the strongest memory model (it is not relaxed at all).
Therefore, any behaviour observable under sequential consistency will also be possible under a relaxed memory model, but a relaxed memory model can exhibit additional behaviour.
In practice, sequential consistency is not sufficient with modern processors
that exhibit relaxed memory behaviour.
The relaxed behaviour of processors arises from optimisations in cache
consistency protocols and observable effects of instruction reordering and
speculation.
The effect of this behaviour is that memory-manipulating instructions can appear
to be executed in a different order than the order in which they appear in a
thread's code, and their effect can in some cases even appear to be in
a different order on different threads.
For efficiency reasons, most modern processors (except for simple ones in
embedded microcontroller and low-cost mobile devices) exhibit relaxed behaviour.
The extent of this relaxation depends on the processor architecture (e.g.,
x86, ARM, POWER) but also on the concrete processor model.
To make matters worse, the actual behaviour of the processor is often not
precisely described by the processor vendor~\mcite{x86tso}.
To abstract from the details of particular processor models, \emph{relaxed
memory models} are used to describe (often formally) behaviour of given
processor architecture.
Examples of relaxed memory models of modern processors are the memory model of
x86 and x86-64 CPUs described formally as \xtso \mcite{x86tso} and the multiple
variants of POWER \mcite{Sarkar2011,MadorHaim2012} and
ARM \mcite{Alglave2014,Flur2016,Pulte2017} memory models.
For the description of a relaxed memory model, it is sufficient to consider
operations that affect memory.
These operations include loads (reading of data from memory to a register
in the processor), stores (writing of data from a register to memory),
memory barriers (used to prevent reordering), and \emph{atomic compound
operations} (read-modify-write operations and compare-and-swap operation).
Compound non-atomic instruction exist on some architectures (e.g., the
\texttt{add} instruction in x86 can have one memory operand), but they
might be rewritten to an equivalent sequence of a load to register, a register
modification, and a store, and therefore need not be considered.
\subsection{The \xtso Memory Model\protect\mnotemark}\label{chap:prelim:xtso}%
\mnotetext{\srcnote{This subsection contains text and figures originally from \cite{SB2018x86tso}.}}
The \xtso memory model is a formal description of the memory model used in x86
and x86-64 processors (manufactured by both Intel and AMD) \mcite{x86tso}.
It is one of the strongest relaxed memory models -- it is relaxed compared to
sequential consistency, but not nearly as relaxed as some of the other common
memory models such as memory models of ARM and POWER processors.
The \xtso is very similar to the SPARC Total Store Order (TSO) memory
model~\mcite{SPARC94}.
It does not reorder stores with each other, and it also does not reorder loads
with other loads.
The only relaxation allowed by \xtso is that a store can appear to be executed
later than an independent load that succeeds it in a thread.
The memory model does not give any limit on how long a store can be delayed.
An example of non-intuitive execution of a simple program under \xtso can be
found in \autoref{fig:xtso}.
\begin{figure}[t] % fig:xtso
\begin{minipage}[c]{0.22\textwidth}
\begin{cppcode}
int x = 0;
int y = 0;
void t0() {
y = 1;
int a = x;
int c = y;
}
void t1() {
x = 1;
int b = y;
int d = x;
}
\end{cppcode}
\end{minipage}
%
\hfill
%
\begin{minipage}[c]{0.77\textwidth}
\begin{center}
\noindent
Is $a = 0 \land b = 0$ reachable?\\[2.5ex]
\medskip
\begin{tikzpicture}[ ->, >=stealth', shorten >=1pt, auto, node distance=3cm
, semithick
, scale=0.5
]
\draw [-] (-10,0) rectangle (-7,-5);
\draw [-] (-10,-1) -- (-7,-1)
(-10,-2) -- (-7,-2)
(-10,-3) -- (-7,-3)
(-10,-4) -- (-7,-4);
\draw [-] (-9,0) -- (-9,-5);
\node () [] at (-8.5,0.5) {shared memory};
\node () [anchor=west] at (-10,-2.5) {\texttt{\color{blue}x}};
\node () [anchor=west] at (-9,-2.5) {\texttt{\color{blue}0}};
\node () [anchor=west] at (-10,-3.5) {\texttt{\color{blue}y}};
\node () [anchor=west] at (-9,-3.5) {\texttt{\color{blue}0}};
\node () [anchor=center] at (-2.5,-3.5) {store buffer};
\draw [-] (-4.5,-4) rectangle (-0.5,-5);
\draw [-] (-2.5,-4) -- (-2.5,-5);
\node () [anchor=center] at (3.5,-3.5) {store buffer};
\draw [-] (1.5,-4) rectangle (5.5,-5);
\draw [-] (3.5,-4) -- (3.5,-5);
\node () [anchor=west] at (-4.5,-4.5) {\texttt{\color{red}y}};
\node () [anchor=west] at (-2.5,-4.5) {\texttt{\color{red}1}};
\node () [anchor=west] at (1.5,-4.5) {\texttt{\color{red}x}};
\node () [anchor=west] at (3.5,-4.5) {\texttt{\color{red}1}};
\node () [anchor = west, xshift = -1em] at (-4.5, 0.5) {thread 0};
\draw [->] (-4.5,0) -- (-4.5,-3);
\node () [anchor=west] at (-4, -0.5) {\texttt{\color{red}y = 1;}};
\node () [anchor=west] at (-4, -1.5) {\texttt{\color{blue}load x; \textrightarrow 0}};
\node () [anchor=west] at (-4, -2.5) {\texttt{\color{frombuf}load y; \textrightarrow 1}};
\node () [anchor = west, xshift = -1em] at (1.5, 0.5) {thread 1};
\draw [->] (1.5,0) -- (1.5,-3);
\node () [anchor=west] at (2, -0.5) {\texttt{\color{red}x = 1;}};
\node () [anchor=west] at (2, -1.5) {\texttt{\color{blue}load y; \textrightarrow 0}};
\node () [anchor=west] at (2, -2.5) {\texttt{\color{frombuf}load x; \textrightarrow 1}};
\end{tikzpicture}
\end{center}
\end{minipage}
\caption{
A demonstration of the \xtso memory model.
The thread 0 stores 1 to variable \texttt{y} and then loads variables \texttt{x} and \texttt{y}.
The thread 1 stores 1 to \texttt{x} and then loads \texttt{y} and \texttt{x}.
Intuitively, we would expect it to be impossible for $a = 0$ and $b = 0$ to both be true at the end of the execution, as there is no interleaving of thread actions which would produce such a result.
However, under \xtso, the stores are cached in the store buffers (marked \textcolor{red}{red}).
A load consults only shared memory and the store buffer of the given thread, which means it can load data from memory and ignore newer values from the other thread (\textcolor{blue}{blue}).
Therefore \texttt{a} and \texttt{b} will contain old values from memory.
On the other hand, \texttt{c} and \texttt{d} will contain local values from the store buffers (locally read values are marked \textcolor{frombuf}{green}).
The figure depicts the state of memory and buffers after the code of both
threads executed, but before the data was propagated from store buffers to
main memory.
}
\label{fig:xtso}
\end{figure}
The operational semantics of \xtso is described in \mcite{x86tso}.
The corresponding machine has multiple hardware threads (or cores), each with
associated local store buffer, a shared memory subsystem, and a shared memory
lock.
Store buffers are first-in-first-out caches into which store entries are saved
before they are propagated to shared memory.
Load instructions first attempt to read from the store buffer of the given
thread, and only if they are not successful, they read from shared memory.
Store instructions push a new entry to the local store buffer.
Entries in the store buffer are not visible to threads other than the one
owning the store buffer.
Atomic instructions include various read-modify-write instructions, e.g. atomic
arithmetic operations (that take memory address and a constant),\mnote{These
instructions have the \texttt{lock} prefix in the assembly, for example,
\texttt{lock xadd} for atomic addition.}
or compare-and-swap instruction.\mnote{\texttt{lock cmpxchg}}
All atomic instructions use the shared memory lock so that only one such
instruction can be executed at a given time, regardless of the number of
hardware threads in the machine.
Furthermore, atomic instructions flush the store buffer of their thread before
they release the lock.
Therefore, the effects of atomic operations are immediately visible, i.e.,
atomics are sequentially consistent on \xtso.
On top of these instructions, \xtso has a full memory barrier (\texttt{mfence})
which flushes the store buffer of the thread that executed it.\mnote{There
are two more fence instructions in the x86 instruction set, but according
to~\cite{x86tso} they are not relevant to normal program execution.}
If a programmer (or a compiler) wishes to recover sequential consistency on
x86, they need to ensure memory stores are propagated to main memory before
subsequent loads execute.
This is most commonly done by inserting a memory fence after each store.
An alternative approach would be to store using atomic exchange instruction
(\texttt{lock xchg}) as it can atomically swap value between a register and a
memory slot.
One of the specifics of x86 is that it can handle unaligned memory
operations.\mnote{Other architectures, for example, ARM require loaded
values to be aligned, usually so that the address is divisible by the value
size.}
While the \xtso paper does not give any specifics about handling unaligned and
mixed memory operations\mnote{%
E.g., writing a 64-bit value and then reading a 16-bit value from inside it.},
it seems from our experiments that such operations are not only fully supported, but they are also correctly synchronised if
atomic instructions are used.
This behaviour is in agreement with the aforementioned operational semantics of \xtso in
which all the atomic operations share a single global lock.
\subsection{Other Hardware Memory Models}
Older works on the analysis of programs under relaxed memory models usually
consider the SPARC TSO, PSO and RMO memory models~\mcite{SPARC94} or the memory
model of ALPHA~\mcite{mckenney2010} processors.
However, these memory models are not very relevant any more, except for TSO, which is sometimes used interchangeably with the \xtso memory model, as
they are very similar (the difference is in the behaviour of atomic compound
operations, but some works use TSO to stand for the memory model of x86 processors).
Currently, apart from the \xtso memory model, mostly the memory models of
variants of POWER and ARM processors are relevant.
While these processors differ significantly in the area in which they are used
(POWER is used in high-performance servers while ARM is mostly used in mobile
devices), their memory models share some basic features.
Recently, there is also interest in the memory model of RISC-V architecture, which is also similar to memory models of POWER and ARM.
All of these types of processors exhibit more relaxed behaviour then \xtso,
for example, it is possible to reorder a load after a store or to reorder
stores or loads with one another (provided they are independent).
On POWER it can also happen that a sequence of operations executed by a single thread will be observed by different other threads in different orders.
An example of such behaviour can be seen in \autoref{fig:prelim:iriw}.
\begin{figure}[tp]
\begin{cppcode}
int x = 0, y = 0;
\end{cppcode}
\begin{multicols}{4}
\begin{cppcode}
void wrt0()
{
x = 1;
}
\end{cppcode}
\columnbreak
\begin{cppcode}
void wrt1()
{
y = 1;
}
\end{cppcode}
\columnbreak
\begin{cppcode}
void read0()
{
int x0 = x;
int y0 = y;
}
\end{cppcode}
\columnbreak
\begin{cppcode}
void read1()
{
int y1 = y;
int x1 = x;
}
\end{cppcode}
\end{multicols}
\vspace{-4.5ex}
\center{\emph{Is it possible that $x_0 = 1 \land y_0 = 0 \land x_1 = 0 \land y_1 = 1$?}}
\caption{
An independent readers of independent writers example commonly used to
demonstrate some of the features of POWER memory models.
We assume each of the functions is executed in a separate thread.
Here, we are asking if it can happen that while the \texttt{read0}
observes the new value of $x$ and old value of $y$ the \texttt{read1}
will observe the old value of $x$ and new value of $y$.
Such an observation would imply that the two modifications are visible
to the two readers in different order.
This behaviour can indeed be observed on POWER \cite{Sarkar2011}, but
not on ARMv8 \cite{Pulte2017}.
}\label{fig:prelim:iriw}
\end{figure}
There are several version (or generations) of ARM and POWER processors, and
also multiple versions of memory models.
The memory model of ARMv7 is described in~\mcite{Alglave2014}.
The memory model of newer ARMv8 is described in~\mcite{Flur2016}, but it was
later revised and simplified in collaboration with ARM and formalised
in~\mcite{Pulte2017} and the ARMv8 architecture description.
Later in~\mcite{Pulte2019}, an operation model of ARMv8 concurrency equivalent
with the one presented in~\mcite{Pulte2017} and optimised for program analysis
was presented.
Interestingly, \mcite{Pulte2019} also presents a memory model of RISC-V
architecture which is similar to the revised ARMv8 memory model.
The basis of POWER memory model was described as an abstract machine
in~\mcite{Sarkar2011} and later extended in~\mcite{Sarkar2012} to cover atomic
compound operations.
A more detailed memory model of POWER is presented in~\mcite{Gray2015},
including for example behaviour of mixed-size memory operations.
Axiomatic description of POWER is provided by~\mcite{MadorHaim2012} and
\mcite{Alglave2010_fences}.
In~\mcite{Flur2017}, the authors describe the behaviour of POWER and ARMv8 in
the presence of mixed-size memory operations.
\subsection{Memory Models of Programming Languages\protect\mnotemark}%
\mnotetext{\srcnote{The paragraphs describing memory models of C, C++, Java, and LLVM in this subsection are based on text originally from \cite{Still2018thesis-proposal}.}}
Not all programming languages have concurrency and memory model defined in the
language.
Indeed, C and C++ before the 2011 revisions of their standards did
not define concurrency and therefore, the majority of concurrent C code relies on non-standard concurrency.
In the absence of programming language support, the programmers wishing to use
concurrency have to rely on a combination of library and compiler support that
provides (platform-specific) means to support concurrency.
For example, on Linux and most POSIX-compatible operating systems, the POSIX
threads library (\texttt{pthreads}) defines ways to launch threads, wait for
them, and synchronise their execution using various (blocking) synchronisation
primitives such as mutexes and condition variables.
A compiler compatible with this library then guarantees that its optimisations
will not break this functionality, for example, that it will not reorder
operations around the calls to synchronisation functions.
The compiler also usually defines low-level synchronisation primitives which correspond to atomic compound operations.
For example, the GCC and clang compilers both define builtin functions such as \texttt{\_\_atomic\_compare\_exchange} and \texttt{\_\_atomic\_add\_fetch}.
These builtins allow semi-portable use of atomic operations, in a sense, they are not specific to the concrete hardware platform.\mnote{As would be the case if the inline assembly was used to invoke the atomic instruction directly.}
Still, they are bound to the given compiler or a group of compilers (such as GCC and clang).
The downside of the library and compiler approach to concurrency is that it
makes it hard to write genuinely platform-independent code in the given language.
For this reason, many programming languages (eventually) provide concurrency
primitives and define a memory model guaranteed by the language.
\paragraph{C and C++}
%
The C++11 \mcite{isocpp11draft} and C11 \mcite{isoc11draft} standards
introduced support for threading and atomic operations to C++ and C.
The memory model of both languages is the same, but they differ in the
syntactic ways in which its features are used.\mnote{For example, C++ defines
\cpp{atomic} template class for atomic variables, but C relies on the
\cpp{_Atomic} keyword, as it has no support for (templated) classes.
The C threading API is mostly similar to the API of POSIX threads, while the
C++ threading has a more modern API which uses classes.}
The C++ memory model was revised in the following standards, with the most
notable changes in the C++20~\mcite{cpp20}, including modifications to the strongest sequentially-consistent version of atomic operations.
These latest changes mostly remedy problems in interaction between the
sequentially consistent operations and weaker memory orderings presented
in~\mcite{Lahav2017}.
The actual changes are described in~\mcite{P0668R4}.
The C++11 memory model is not formalised in the C++11 standard.
An attempt to formalise it was given in \mcite{cppmemmod}, formalising the
N3092 draft of the standard \mcite{N3092}.
While this formalisation precedes the final C++11 standard, it seems that there were no changes in the specification of atomic operations after N3092.
Nevertheless, there are some differences between the formalisation and N3092
(which are justified in the paper).
The formalisation was later revised in~\mcite{Lahav2017}, which led to the aforementioned revision of concurrency in C++20.
Overall, the C++ memory model is complex and complicated by its intention to
allow high-performance on various existing or hypothetical future hardware
platforms.
Atomic variables and operations play a central role in in the C++ memory model.
Atomic variables are variables of particular types that define atomic operations
such as loads, stores, atomic read-modify-write, and compare-exchange.
For an atomic operation, it is possible to specify the required memory
ordering: C/C++ allows not only sequentially consistent atomic operations but
also weaker (low-level) atomic operations that enable implementation of
efficient parallel data structures in a platform-independent way.
An example of C++ code that leverages atomic variables is shown in \autoref{fig:prel:cppatomic0} and \autoref{fig:prel:cppatomic1}.
\begin{figure}[t]
\begin{cppcode}
std::atomic<int> x;
int y;
\end{cppcode}
\begin{multicols}{2}
\begin{cppcode}
void thr0() {
x.fetch_add(1); // OK
y++; // RACE
}
\end{cppcode}
\columnbreak
\begin{cppcode}
void thr1() {
x++; // also OK
y++; // RACE
}
\end{cppcode}
\end{multicols}
\vspace{-4.5ex}
\caption{A basic example of C++ atomics.
The variable \texttt{x} is atomic, and therefore can be safely used in
concurrent settings (the standard modification operators for integral types
are available and atomic).
On the other hand, the variable \texttt{y} is not atomic and therefore
incrementing it from two threads causes data race.
}\label{fig:prel:cppatomic0}
\end{figure}
\begin{figure}[t]
\begin{cppcode}
Data data;
std::atomic< bool > ready;
void thr0() {
data.load_data();
read.store( true, std::memory_order::release );
}
void thr1() {
while ( !read.load( std::memory_order::relaxed ) )
{ }
std::atomic_thread_fence( std::memory_order::acquire );
data.do_work();
}
\end{cppcode}
\vspace{-2ex}
\caption{A simple example of use of low-level atomic API.
Here \texttt{thr0} loads \texttt{data} somehow and then signals to
\texttt{thr1} that it should be processed.
The \texttt{data} variable itself is not atomic, and presumably, the type
is not made for concurrent access.
An atomic boolean is used to wait for \texttt{data} in \texttt{thr1} --
once \texttt{data} is loaded the \texttt{ready} flag is set, using the
\emph{release} memory ordering.
The actual waiting in \texttt{thr1} uses the weakest \emph{relaxed}
ordering which does not cause any synchronisation (it just ensures
concurrent changes to the variable itself are consistent), but once the
wait ends a fence is executed.
The fence uses \emph{acquire} ordering and therefore completes
\emph{release-acquire} ordering between \texttt{thr0} and \texttt{thr1}
-- anything written by \texttt{thr0} before the \emph{release} store is
available to \texttt{thr1} after the \emph{acquire} fence.
An interesting property of this example is that it compiles with no extra
synchronisation or atomic instructions on x86 (atomic only affects
compiler optimisations there), but uses synchronisation on more relaxed
platforms.
}\label{fig:prel:cppatomic1}
\end{figure}
A notable feature of the C++ memory model is that any program that contains a
data race on a non-atomic variable\mnote{Data race is defined as two accesses
to the same non-atomic variable, at least one of them being a write, that are
not synchronised so that they cannot happen concurrently.} has undefined
behaviour.
Therefore, synchronisation is possible only by atomic variables and
concurrency primitives such as mutexes and condition variables.
\paragraph{Java}
%
The Java memory model is rather different from the C/C++11 one.
Its primary goal is to ensure that programs that cannot observe data races
under sequential consistency will execute as if running under sequential
consistency (the data race free guarantee)~\mcite{javamm_popl_Manson2005}.
The primary means of synchronisation in Java are mutexes (called monitors in Java), synchronised sections of code (which use monitors internally), and volatile variables, which roughly correspond to sequentially consistent atomics in C++11.
Furthermore, as Java strives to be memory safe, it also defines the behaviour of programs with data races.
This behaviour is rather peculiar, as it is primarily concerned with prohibiting out-of-thin-air values -- values which, informally speaking, depend cyclically on themselves.
These values are primarily prohibited to avoid forging pointers to invalid memory or memory which should be otherwise inaccessible to a given thread \mcite{javamm_popl_Manson2005}.
\paragraph{LLVM}
%
The LLVM Intermediate Representation has a memory model derived from the C++
memory model, with the difference that it lacks release-consume ordering and
offers additional \emph{unordered} ordering that does not guarantee atomicity
but makes results of data races defined \mcite{llvm:langref}.
The \emph{unordered} operations are intended to match the semantics of the Java
memory model for shared variables.
\subsection{Approaches to Multi-Threaded Software}
At the highest level, multiple threads of a program can either use
\emph{message passing}, or \emph{shared memory}, to communicate.\mnote{This can
be generalised to multiple cooperating programs, but the distinction is mostly
technical. With multiple programs, message passing is used more often,
especially in distributed computing.}
While message passing can be done in any reasonably expressive programming
language with support for concurrency, there are programming languages which
adopt it as the primary way of implementing concurrent or distributed programs, for example, Erlang or Go.
In this work, we are focusing on C++\mnote{And by extension to other
general-purpose programming languages with concurrency support, e.g., Java and
C\#.} which has no built-in support for message-passing concurrency, therefore
we will focus on shared memory concurrency.
In C++ and similar languages, message passing can be implemented as an
abstraction over shared memory.
\paragraph{Lock-Based Synchronisation}
Synchronisation plays a crucial role in shared memory concurrency.
Concurrent unsynchronised access of two or more threads to the same memory location, with
at least one of the threads writing to it, will cause data race which can lead
to data corruption and program malfunction.
Lock-based synchronisation (critical sections) is often used as it is
reasonably simple to understand it, and is often natively supported by
programming languages.\mnote{In C/C++ since the 2011 standards.}
Furthermore, locks usually use operating system primitives to block waiting
threads while another thread executes the critical section, which can increase
the performance of systems with more threads than available processors as the
waiting threads are inactive and other threads can run in the meanwhile.
On the other hand, synchronisation can degrade the performance of concurrent
software.
In the extreme case when all interesting work is done in a single critical
section executed by many threads, there will be no gain from parallel execution
and the overhead of threads and their synchronisation will make the program
slower than its sequential version.
For this reason, critical sections should be as short as safely possible and
different areas of shared memory should be protected by different locks so
that the program can access different areas of shared memory concurrently.
This inherently increases the complexity of synchronisation and risks of data races
or deadlocks -- indeed deadlocks are easy to avoid with one lock, but with
multiple locks, care must be taken if more then one lock is held at one time,
which is a common situation.
\paragraph{Atomic Access and Lock-Free Programming}
To further improve performance, it is sometimes useful to avoid
operating-system-assisted synchronisation and use atomic operations instead.
These operations can be used to guarantee synchronisation over a word-sized
location of memory.\mnote{The size of the atomically-accessed area is
usually the same as the size of a pointer on the given platform (e.g., 8 bytes on
64-bit platforms), or sometimes twice the size of the pointer (e.g. 16 bytes on
newer x86-64 processors).}
Atomic operations are used to implement \emph{lock-free} algorithms and data
structures which do not require critical sections at all.
In this case, extra care must be taken to the ordering of operations that cannot
be performed atomically due to the size limit of atomic instructions and the
programmer must carefully evaluate the impact of memory model on their
algorithm.
An example of (a fragment of) lock-free data structure can be found in \autoref{fig:prelim:lockfree}.
\begin{widefigure}
\begin{cppcodeln}
void push( const T &x ) {
Node *t;
Node *ltail = tail.load( std::memory_order_acquire );
if ( ltail->write.load( std::memory_order_relaxed )
== ltail->buffer + NodeSize )
t = new Node();
else
t = ltail;
*t->write.load( std::memory_order_relaxed ) = x;
t->write.fetch_add( 1, std::memory_order_release );
if ( ltail != t ) {
ltail->next.store( t, std::memory_order_release );
tail.store( t, std::memory_order_release );
}
}
\end{cppcodeln}
\vspace{-\bigskipamount}
\caption[A push in a lock-free queue]{A push (enqueue) operation of a single-producer-single-consumer
lock-free queue from an older version of DIVINE (modified to use C++11
atomic operations).
The queue uses a linked list of blocks (of type \texttt{Node}).
Each block can hold a fixed number of elements of some type~\texttt{T}.
On lines 2--8 the function checks whether there is a room in the last
allocated block and allocates a new block if necessary.
Both the \texttt{tail} and \texttt{write} fields in the \texttt{Node} are
atomic variables, and we explicitly use memory ordering to avoid expensive
synchronisation unnecessary on x86-64 processors.
Then, on line 10, the actual value is written to the node, on the location
pointed to by the \texttt{write} field -- note that the
\texttt{write} pointer is atomic, but the value it points to is not
atomic.
On line 11, we shift the write pointer, which makes the data available to
the consumer thread.
At this point, we need to use release
memory ordering to ensure safety on platforms which are more relaxed then
x86 and to prevent reordering by the compiler.
The release ordering ensures
that any changes performed by the producer thread so far will be available
to the consumer once it performs an acquire load on the \texttt{write}
pointer (to check that the queue is not empty).
Finally, on lines 13--16 we append the linked list of nodes if a new node
was created on line 6.
Again, we use the release memory ordering to ensure all operations
performed so far (in particular the shift of \texttt{write} on line 11) are
visible once these changes become visible.
Extending this approach to multiple producers is not
trivial as the check if there is space available and the publication of the
written value would need to be performed at the same time.
This would require either modification of two different locations in one