forked from paulmckrcu/perfbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
todo.txt
685 lines (472 loc) · 21.3 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
o Julia language as parallel panacea? https://julialang.org/ and
https://en.wikipedia.org/wiki/Julia_(programming_language
@@@
o Move "Why Memory Barriers?" appendix to precede software
memory-ordering chapter.
o Store buffers before cache coherency protocols.
o C.1 Cache Structure: Update to 2022. Check up on
recent multi-socket systems.
o Example cache layout for capacity miss?
o Better illustration of the miss examples in the
text starting with the paragraph starting with
"The situation depicted in..."
Maybe use smaller number of sets? But hexadecimal
simplifies things.
o C.1 Cache Structure: Have subsections on store buffers
and misordering. Pull examples from slidesets.
o C.2 Cache-Coherence Protocols: "(now IBM)" to
"(then IBM and now obsolete)" or similar.
o C.3 Stores Result in Unnecessary Stalls: Portions
independent of MESI to C.1.x.
o Maybe leave ordering-hostile-architecture section in
the appendix, and see if shared-store-buffer architecture
justifies LKMM.
o Move 15.1 Ordering Why and How? to new hardware section.
But leave cheat sheet in chapter 15.
Ditto 15.2.1 Variables With Multiple Values.
Maybe just fold parts of Appendix C into the beginnings of
Chapter 15?
o Old single-CPU cacheless systems.
o Caches.
o Store buffers. QQ: Didn't store buffers come first?
A: In some cases, yes.
o Invalidation queues.
o Add a section on RCU design tradeoffs, perhaps extend to
cover hazard pointers and sequence locking. Work with Joel
on this item.
o QSBR vs. rcu_read_lock() actually doing something.
o Counter-based methods.
o Time-based methods.
o Encourage holdout readers?
o resched_cpu()
o Preemption disabling.
o Priority boosting.
o Enable help from preemption points and
might-sleep calls.
o Special cases for single-CPU non-preemptible situations.
o Complex data structures:
o Just wrap the update in sequence locking.
o Issaquah Challenge.
o Jon Walpole's and my algorithms (and recent rediscovery).
o Consider adding kfifo to NBS section. Perhaps also
atomic queues and stacks. Reference hazard pointers.
LIFO push stack, definitely!!! [DONE]
o Single writer principle. Mechanical sympathy.
("Disruptor" presentation at LCA2013.)
Martin Thompson: mechanical-sympathy.blogspot.com
Mike Barker: bad-concurrency.blogspot.com
github.com/LMAX-Exchange/disruptor
o Add gcc padding and alignment directives to toolsoftrade.
o Add RACES paper material, perhaps as another "future" section.
Also add notes from RACES workshop.
o In cpu/hwfreelunch.tex, add asynchronous-hardware citation
from Ivan Sutherland or some such.
o In intro/intro.tex in the "Use Existing Parallel Software"
section, add citations for web-application serving and
map-reduce.
o Consider refactoring test code in CodeSamples directory tree.
Probably some opportunity for common measurement code and
also common argument-parsing code.
o Add discussion of reader-writer locking to the locking chapter
implementation discussion. Ticket rwlocks, brlock/lglock,
unfair locks (and reasons for them), forward reference to
seqlock, RCU lock, and barrier lock.
o Consider an HPC-style barrier section in defer chapter.
o Consider expanding on dequeue discussion, especially on queues,
in data-structures chapter (e.g., limits of ordering guarantees).
Need hash tables there, of course!
o Data-race detection [Bart Van Assche January 2010]
KCSAN, Helgrind, DRD, ThreadSanitizer, Intel Thread Checker,
Intel Parallel Studio, DIOTA, ...
Valgrind-based tools allow suppression patterns that
span multiple call stack frames, other tools reportedly
do not.
Citation:
J. Maebe, M. Ronsse, and K. De Bosschere. *DIOTA: Dynamic
instrumentation, optimization and transformation
of applications*. In Proceedings of WBT-2002,
Charlottesville, Virginia, USA, September 2002.
o Consider adding discussion of Amdahl's Law and Gustafson's
Law to the introduction. [Kay Muller]
o Consider adding something about VLIW and SIMD to hwhabits?
Maybe something more about memory hierarchy? [Kay Muller]
o Need some list somewhere of environments that allow you to
avoid worrying about locking and threading: BOINC, map-reduce,
SQL, sequential-program optimizations, ...
o Update "SMP Synchronization Design" to cover parallelism in
general. Discuss partitioning up front. Describe two major
cases, pipelining and data parallelism. Code locking can be
analogous to pipelining, but probably needs a separate section.
o What is the form of the need for added performance?
o Throughput in face of many relatively small
units of work. (multiple instances.
data parallelism. DBMS. Web serving.)
o Throughput of large block(s) of work.
(pipelining, or "series". parallel decompostion,
or "parallel".)
o Latency of units of work that are large relative
to the communications overhead. (overlapped
pipelining, parallel decomposition.)
o Latency of an aggregate of many units of work
that are small relative to the communications
overhead. (batch the units of work into larger
units and then re-ask the question relative to
the batches)
o Latency of units of work that are small relative
to the communications overhead. (sequential
programming optimizations. full-up redesign.
approximations/heuristic techniques.
overclocking. implement in hardware, e.g.,
FPGAs.)
o Large programs or systems might have different needs
in different places -- thus might need combination of
techniques.
o Move data-structures discussion to this chapter.
o Add quote "weak ordering requires strong minds" and dissection
thereof to memory-ordering discussion.
o Add section to intro/hwhabits.tex giving intuitive notion of
how cache coherence protocol and write buffer makes atomic
instructions expensive. Draw from URochester slide set
(in paper/scalability/TR-TMeval or thereabouts).
o Find Herlihy's topology-analogy stuff and add hints to the
formal-methods section.
o Fill out material.
o GPGPUs in Data Ownership chapter.
o Fill out "defer" section, starting with fork-join examples.
Relate to state-space complexity.
o Add material from "is parallel programming hard" position
paper. (After publication.)
x Pull in material from LJ2003 RCU and dissertation
demonstrating locking performance issues.
o Need citation for Matlab*p.
x Add Steve's and my dyntick paper to analysis as an example
of simpler examples.
x Add stuff from differential profiling paper.
x Memory barriers -- get principles lined out, then show
"bookend" implementations.
x Add appendix with mathematics for hard realtime response
using locks. Or just cite Bjoern Brandenberg's dissertation.
x Create code examples for SMPdesign stuff, fill out
Hierarchical Locking, Resource Allocation Caches,
Performance Summary.
MOSTLY DONE...
x Refocus the SMPdesign chapter to partitioning.
x Fill out RCU patterns, also corresponding code and
performance summary.
x RCU and atomic list movement (Josh?)
x RCU and reference counts
x Converting rwlocks to RCU
x Consider adding history of RCU from OSR paper (low priority)
o Other stuff Paul McKenney has from previous publications.
o Other stuff that must be generated or harvested from
elsewhere.
o Note that pthread_mutex permits static initialization.
(Windows apparently does not.)
x Finish whymemorybarriers.tex
o Come up with suitable API for example code, so that a single
program can be used for each example, covering the pthreads,
Linux-kernel-module, and other C-language environments.
double dgettimeofday(void);
void spin_lock_init(spinlock_t *sp); /* Must be called. */
void spin_lock(spinlock_t *sp);
int spin_trylock(spinlock_t *sp);
void spin_unlock(spinlock_t *sp);
thread_id_t create_thread(void *(*func), void *arg);
void *wait_thread(pthread_id_t tid);
-- See CodeSamples/pthreads/api-pthreads.h for a start.
Clearly Java will need its own code.
------------------------------------------------------------------------
x Add the various livejournal posts on memory order, especially
the one on transitivity. Consider also adding Will Deacon's
January 6 2011 QoS story.
x Fix the references in defer/rcu*.tex to "green" and "red" cells
in the various tables. Change to refer to the symbols actually
used.
x Verify versions of cartoons. (Need whippersnapper in intro)
x Upgrade Quick Quiz scripts to remove material in second parameter
when aggregating into "answers" section. Trivialize by changing
required format to put closing "}" before end tag. Then get rid
of the special cases for code formatting in QuickQuiz answers.
x Move API description to an appendix.
x Fill out material.
x Finish converting WhyRCU LWN series.
x Add cautions to analysis section. Steve's and
my "Integrating and Validating dynticks and Preemptable RCU"
could be a good start, comparing to the rcutree.c
dynticks implementation.
x Also add "what to do instead of parallel programming"
section.
x Move the formal-verification sections to an appendix.
x Add the dyntick verification. (~/paper/KernelJanitor/RCU/dyntick/LWN)
x Fix rcutorture.h to correctly compute from time. Allow the
test time as a third argument.
x Reference counting: basic problem: make data structure stick
around while you are incrementing the reference count.
Strategies: lock, pre-existing reference, atomic check for
zero reference count coupled with GC/RCU, various
non-blocking-synchronization tricks.
x Copy rcutree full data-structure diagram to section describing
force_quiescent_state() callee traversal of the leaf rcu_node
structures. Decorate with right-facing arrow to show search
direction. Or perhaps copy diagram that shows initialization
state.
x Add \RevisedFrom to origpub.tex to allow something like:
The ideas discussed in Section... were originally presented
in XXX\cite{}.
Allow ranges of sections for OriginallyFrom.
x Discuss other textbooks in intro.
x Figure out some way to get section headers between "Answers to
Quick Quizzes" for different chapters. Want to avoid any
section header for a chapter that has no "quick quizzes".
Not a big deal, nice to have.
x Add a "Tools of the Trade" chapter before "SMP Synchronization
Design" giving an overview of how locking and atomic operations
work. Need simple fork/join, along with foreshadowing of how
good design can simplify implementation.
x Add variables to rest of examples in toyrcu.tex.
x In the double-ended-queue code, consider renaming the
"enqueue" operations to "push" and the "dequeue" operations
to "pop".
x Need a chapter on counting.
x Add the separate-thread approach suggested unwittingly by
Christoph Lameter to the count.tex chapter.
x Embed fonts from .eps files: http://www.wkiri.com/today/?p=60
See the sed script:
cat original_graphics.eps | sed 's+Times-Bold+NimbusSanL-Bold+g' |\
sed 's+Times-Roman+NimbusSanL-Regu+g' |\
sed 's+Times+NimbusSanL-Regu+g' |\
sed 's+Helvetica-BoldOblique+NimbusSanL-BoldItal+g' |\
sed 's+Helvetica-Oblique+NimbusSanL-ReguItal+g' |\
sed 's+Helvetica-Bold+NimbusSanL-Bold+g' |\
sed 's+Helvetica-Bold-iso+NimbusSanL-Bold+g' |\
sed 's+Helvetica+NimbusSanL-Regu+g' |\
sed 's+Helvetica-iso+NimbusSanL-Regu+g' |\
sed 's+Symbol+StandardSymL+g' > new_graphics.eps
And the additional conversions:
Courier NimbusMonL-Regu
Courier-Bold NimbusMonL-Bold
Courier-Oblique NimbusMonL-ReguObli
Courier-BoldOblique NimbusMonL-BoldObli
x Change calls to bzero() to memset(). [Horst]
x dvipdf is quite slow, especially on 1-up setups:
time dvips perfbook
real 0m3.405s
user 0m3.184s
sys 0m0.224s
time psnup -2 perfbook.ps perfbook-2up.ps
real 0m0.852s
user 0m0.248s
sys 0m0.308s
time ps2pdf perfbook-2up.ps perfbook-2up.pdf
real 0m25.383s
user 0m24.646s
sys 0m0.276s
time dvipdf perfbook
real 1m31.456s
user 0m48.347s
sys 0m45.599s
Could parallelism be profitably applied here? ;-)
Or perhaps just careful avoidance of quadratic algorithms?
Use pdflatex instead, much faster!
x Explicitly permit reformatted copies (single-column, ebook, etc.).
x Change from "preemptable" to "preemptible". [Horst]
x Produce an HTML version. One possibility is latex2html
and scripting, and tamasrepus suggests
http://johnmacfarlane.net/pandoc/.
Challenges include the scripting and special latex macros
that hendle the quick quizzes and the illustration credits.
x Add discussion of seqlock to defer chapter between refcnt and RCU.
Use trivial example showing consistency. Note inapplicability
to pointer-based data structures.
x Update chain of RCU implementations to add http://lttng.org/urcu.
x Line up reviewers. Need a range of viewpoints, and especially
a range of familiarity with SMP coding. [Now that the book is
public, just let them line themselves up!]
x Work out who else to invite to contribute. Possibilities include:
[Now that the book is public, just let them line themselves up!]
o Real-life use of RCU: Dipankar Sarma, Steve Hemminger,
Maneesh Soni, David Howells, ...
o Nick Piggin: radix tree.
o Robert Ohlsson: fib trie.
o Steve Hemminger: brlock replacement.
o SELinux guys.
o Extreme scalability: Christoph Lameter, Jesse Barnes,
other SGI and ex-SGI folks.
o Various architecture maintainers for issues on abstract
CPUs and other hardware issues.
o Chinese angle: guys from Tsinghua U. in Beijing.
o Stuff from Rusty's Unreliable Guides.
o Rusty Commentary version.
o Tom Hart on RCU performance measurement.
Perhaps also on Hazard Pointers (maybe Maged Michael
as well or instead).
o Josh Triplett on RCU error checking in sparse.
o Val Henson on academic research on parallel SMP.
o http://globaltext.org/ -- mailto:[email protected]
+1-706-542-3706. "Wiki-based" textbooks. This is
the Terry College of Business at U. of Georgia,
but what is wrong with a little cross-pollenation?
The usual experience is that somewhere between 20% and 50%
of the people who get excited about contributing really will
do so.
x Need to make a chapter on locking (as opposed to the current
one on synchronization). [IN PROGRESS]
x Consider using parallel maze solving as an example. See:
http://www.futurechips.org/tips-for-power-coders/parallel-programming-tutorial-parallelizing-somewhat-complex-code-part-1.html
A better approach might assign one processor to each end of
the maze. Chopping the maze up into blocks might help, but
this would likely encounter pathological cases with lots
of boundary crossing.
Another approach is to start each CPU at a random point in
the maze (in addition to the one each at start and end).
Give each CPU a color, and when there is a path from beginning
to end, you are done. Possibly an application for parallel
union-find. ;-)
x Update TM section based on OSR "grass is greener" paper and
relevant paragraph from urcu paper.
x Move OSR RCU history paper to appendix.
x Add mechanism to automatically display authorhood.
x My local view adds a modified book.cls that adds the bibliography
to the table of contents. However, the license on book.cls forbids
distributing it without also distributing latex. Need to come up
with a better way!
In the meantime, edit your local copy of book.cls and remove the
"*" following the "chapter" directive in "thebibliography" macro.
x Add support for real 64-bit x86 builds.
x Use "inkscape --export-pdf=fn.pdf fn.svg" to create PDFs from
inkscape .svg files. Alternatively:
inkscape --export-pdf=fn.pdf --export-latex fn.svg
gets an fn.pdf_tex file that allows text in the diagram to
be typeset by Latex, allowing things like references and
citations to be automatically generated within inkscape drawings.
x Add Hazard-Pointer section between refcnt and seqlock sections
of Deferred Processing chapter.
x Add SNZI to the counting chapter.
x In the intro/intro.tex quick-quiz answer citing Herb
Sutter's "Effective Concurrency" column, also cite
Anthony Williams's book.
x Put RCUApplicability.eps into the defer chapter.
x Change "non-idempotent" to "irrevocable" and "idempotent"
to "revocable" in TM discussion.
x Add a chapter on performance analysis and benchmarking.
X Move memory-allocation code from "Partitioning and Synchronization
Design" to "Data Ownership"?
x Cite http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
for reference on CUDA, MPI, etc.
x Change ASCII double quotes to TeX quotes. [Horst]
x Use the existence problem as motivation for RCU. Show some
solutions in the locking chapter. See the day-2 ACACES slideset.
See also the RCU tutorial and the SC22WG14 email.
o spin_lock_irqsave(): Critical sections must guarantee forward
progress against everything except NMI handlers.
o raw_spin_lock(): Critical sections must guarantee forward
progress against everything except IRQ (including softirq)
and NMI handlers.
o spin_lock(): Critical sections must guarantee forward
progress against everything except IRQ (again including
softirq) and NMI handlers and (given CONFIG_PREEMPT_RT)
higher-priority realtime tasks.
o mutex_lock(): Critical sections need not guarantee
forward progress, as general blocking is permitted.
The other issue is the scope of the lock. The Linux kernel has
the following:
o BKL: global scope.
o Everything else: scope defined by the use of the underlying
lock variable.
One of the many reasons that we are trying to get rid of
BKL is because it combines global scope with relatively weak
forward-progress guarantees.
x First edition changes:
X Move "Data Structures" to precede "Applying RCU".
Add a tree or two.
x Change "Applying RCU" name to "Putting It All Together".
Add examples from dcache, semaphore mapping, ...
Publish v0.9-yyyy.mm.dd, call for comments. Best 3? sets of comments
get a dead-tree signed first edition.
Which will be v1.0-yyyy.mm.dd.
x Consider linking quick quizzes to the answers.
X Convert bloatwatch RCU and add to appendix/rcuimpl.
X RCU implementations appendix:
x RCU requirements/desiderata. (See paper on
desk at home for start, along with the usual
papers.) Also look in the advsync/rcu.tex section
on RCU, which needs to be folded into the
sections in defer/
o Uniprocessor RCU implementation. Simplify
the current one so that the rcu and rcu_bh
implementations are identical, as appropriate
for a device not subject to networking DoS
attacks.
x Toy implementations. Follow wikipedia, but add
reference-count approach.
x Performance and scalability for toy versions.
x User-level RCU implementations (after linux.conf.au
2008)
o Non-realtime kernel implementations. Take from
dissertation and from LWN articles. Or, given
treercu, just cite dissertation.
x Realtime kernel implementations.
o RCU priority boosting. I suppose getting a good
boosting implementation would help...
x Hierarchical implementation (after November 20, 2008
due to LWN subscriber-only period).
Also propagate references into
appendix/formal/dyntickrcu.tex.
Also add code walkthrough. You will need to do
the walkthrough to find rcutree bugs anyway...
o Derive RCU requirements from list of examples,
propagate list up to defer chapter.
(rcufundamental.tex)
o Crisply present advantages and disadvantages of
RCU in defer chapter.
x https://lwn.net/Articles/667593/ to QC futures. Plus
https://lwn.net/Articles/667720/.
x Expant SAT-solver section with recent experience.
x Move intro/hwhabits.tex to cpu/. Add sections with increasingly
detailed pictures of what a CPU does, including speed-of-light
round-trip time (~3 cm, or a bit more than an inch) at 5GHz.
3D fabrication might help, but watch the fabrication cost and
the heat dissipation!!! Also, electrons travel from 3-30 times
more slowly in silicon than light does in a vacuum, and even
more slowly if there are levels of clocked logic in the way.
Include actual numbers/graphs from real machines. Show effects
of various forms of misbehavior.
x Add Makefile (and comments) to CodeSamples directories.
(defer and SMPdesign done.)
x Update Test/perf.gpl set of tests for atomic operations and
cache-line movement and add to CodeSamples/cpu, then rerun
this to create a new table of atomic operations.
X Reorganize memory-ordering introduction with Michel Dagenais's
feedback in mind. See email dated August 20, 2012.
x Add more terms to vocabulary. More TM stuff, for example.
Also RCU nomenclature [done].
------------------------------------------------------------------------
PROGRESS
August 5, 2006: 76 pages
August 12, 2006: 84 pages
August 20, 2006: 90 pages
September 3, 2006: 113 pages
September 17, 2006: 123 pages
-- LJ paper, discussions on memory barriers --
October 11, 2006: 111 pages
November 5, 2006: 135 pages
October 20, 2008: 176 pages
October 30, 2008: 199 pages
November 7, 2008: 205 pages
Change page-layout parameters...
November 7, 2008: 168 pages
November 9, 2008: 200 pages
November 16, 2008: 220 pages
November 30, 2008: 232 pages
December 8, 2008: 250 pages
December 23, 2008: 266 pages
January 13, 2009: 276 pages
January 31, 2009: 288 pages
February 28, 2009: 292 pages
May 24, 2009: 302 pages
July 3, 2009: 308 pages
February 21, 2010: 342 pages
February 21, 2010: 354 pages
Change page-layout parameters...
February 20, 2011: 379 pages
March 23, 2011: Test edition posted on lulu.com: http://www.lulu.com/content/paperback-book/is-parallel-programming-hard-and-if-so-what-can-you-do-about-it/10357070