todo.txt

o	Julia language as parallel panacea?  https://julialang.org/ and
	https://en.wikipedia.org/wiki/Julia_(programming_language

@@@

o	Move "Why Memory Barriers?" appendix to precede software
	memory-ordering chapter.

	o	Store buffers before cache coherency protocols.

	o	C.1 Cache Structure: Update to 2022.  Check up on
		recent multi-socket systems.

		o	Example cache layout for capacity miss?

		o	Better illustration of the miss examples in the
			text starting with the paragraph starting with
			"The situation depicted in..."

			Maybe use smaller number of sets?  But hexadecimal
			simplifies things.

	o	C.1 Cache Structure: Have subsections on store buffers
		and misordering.  Pull examples from slidesets.

	o	C.2 Cache-Coherence Protocols: "(now IBM)" to
		"(then IBM and now obsolete)" or similar.

	o	C.3 Stores Result in Unnecessary Stalls: Portions
		independent of MESI to C.1.x.

	o	Maybe leave ordering-hostile-architecture section in
		the appendix, and see if shared-store-buffer architecture
		justifies LKMM.

	o	Move 15.1 Ordering Why and How? to new hardware section.
		But leave cheat sheet in chapter 15.

		Ditto 15.2.1 Variables With Multiple Values.

	Maybe just fold parts of Appendix C into the beginnings of
	Chapter 15?

	o	Old single-CPU cacheless systems.

	o	Caches.

	o	Store buffers.  QQ: Didn't store buffers come first?
		A: In some cases, yes.

	o	Invalidation queues.

o	Add a section on RCU design tradeoffs, perhaps extend to
	cover hazard pointers and sequence locking.  Work with Joel
	on this item.

	o	QSBR vs. rcu_read_lock() actually doing something.

		o	Counter-based methods.

		o	Time-based methods.

	o	Encourage holdout readers?

		o	resched_cpu()

		o	Preemption disabling.

		o	Priority boosting.

		o	Enable help from preemption points and
			might-sleep calls.

	o	Special cases for single-CPU non-preemptible situations.

o	Complex data structures:

	o	Just wrap the update in sequence locking.

	o	Issaquah Challenge.

	o	Jon Walpole's and my algorithms (and recent rediscovery).

o	Consider adding kfifo to NBS section.  Perhaps also
	atomic queues and stacks.  Reference hazard pointers.

	LIFO push stack, definitely!!! [DONE]

o	Single writer principle.  Mechanical sympathy.
	("Disruptor" presentation at LCA2013.)

	Martin Thompson: mechanical-sympathy.blogspot.com
	Mike Barker: bad-concurrency.blogspot.com
	github.com/LMAX-Exchange/disruptor

o	Add gcc padding and alignment directives to toolsoftrade.

o	Add RACES paper material, perhaps as another "future" section.
	Also add notes from RACES workshop.

o	In cpu/hwfreelunch.tex, add asynchronous-hardware citation
	from Ivan Sutherland or some such.

o	In intro/intro.tex in the "Use Existing Parallel Software"
	section, add citations for web-application serving and
	map-reduce.

o	Consider refactoring test code in CodeSamples directory tree.
	Probably some opportunity for common measurement code and
	also common argument-parsing code.

o	Add discussion of reader-writer locking to the locking chapter
	implementation discussion.  Ticket rwlocks, brlock/lglock,
	unfair locks (and reasons for them), forward reference to
	seqlock, RCU lock, and barrier lock.

o	Consider an HPC-style barrier section in defer chapter.

o	Consider expanding on dequeue discussion, especially on queues,
	in data-structures chapter (e.g., limits of ordering guarantees).
	Need hash tables there, of course!

o	Data-race detection [Bart Van Assche January 2010]
	KCSAN, Helgrind, DRD, ThreadSanitizer, Intel Thread Checker,
	Intel Parallel Studio, DIOTA, ...

	Valgrind-based tools allow suppression patterns that
	span multiple call stack frames, other tools reportedly
	do not.

	Citation:

		J. Maebe, M. Ronsse, and K. De Bosschere. *DIOTA: Dynamic
		instrumentation, optimization and transformation
		of applications*. In Proceedings of WBT-2002,
		Charlottesville, Virginia, USA, September 2002.

o	Consider adding discussion of Amdahl's Law and Gustafson's
	Law to the introduction.  [Kay Muller]

o	Consider adding something about VLIW and SIMD to hwhabits?
	Maybe something more about memory hierarchy?  [Kay Muller]

o	Need some list somewhere of environments that allow you to
	avoid worrying about locking and threading:  BOINC, map-reduce,
	SQL, sequential-program optimizations, ...

o	Update "SMP Synchronization Design" to cover parallelism in
	general.  Discuss partitioning up front.  Describe two major
	cases, pipelining and data parallelism.  Code locking can be
	analogous to pipelining, but probably needs a separate section.

	o	What is the form of the need for added performance?

		o	Throughput in face of many relatively small
			units of work.	(multiple instances.
			data parallelism.  DBMS.  Web serving.)

		o	Throughput of large block(s) of work.
			(pipelining, or "series".  parallel decompostion,
			or "parallel".)

		o	Latency of units of work that are large relative
			to the communications overhead.  (overlapped
			pipelining, parallel decomposition.)

		o	Latency of an aggregate of many units of work
			that are small relative to the communications
			overhead.  (batch the units of work into larger
			units and then re-ask the question relative to
			the batches)

		o	Latency of units of work that are small relative
			to the communications overhead.  (sequential
			programming optimizations.  full-up redesign.
			approximations/heuristic techniques.
			overclocking.  implement in hardware, e.g.,
			FPGAs.)

	o	Large programs or systems might have different needs
		in different places -- thus might need combination of
		techniques.

	o	Move data-structures discussion to this chapter.

o	Add quote "weak ordering requires strong minds" and dissection
	thereof to memory-ordering discussion.

o	Add section to intro/hwhabits.tex giving intuitive notion of
	how cache coherence protocol and write buffer makes atomic
	instructions expensive.  Draw from URochester slide set
	(in paper/scalability/TR-TMeval or thereabouts).

o	Find Herlihy's topology-analogy stuff and add hints to the
	formal-methods section.

o	Fill out material.

	o	GPGPUs in Data Ownership chapter.

	o	Fill out "defer" section, starting with fork-join examples.
		Relate to state-space complexity.

	o	Add material from "is parallel programming hard" position
		paper.  (After publication.)

	x	Pull in material from LJ2003 RCU and dissertation
		demonstrating locking performance issues.

	o	Need citation for Matlab*p.

	x	Add Steve's and my dyntick paper to analysis as an example
		of simpler examples.

	x	Add stuff from differential profiling paper.

	x	Memory barriers -- get principles lined out, then show
		"bookend" implementations.

	x	Add appendix with mathematics for hard realtime response
		using locks.  Or just cite Bjoern Brandenberg's dissertation.

	x	Create code examples for SMPdesign stuff, fill out
		Hierarchical Locking, Resource Allocation Caches,
		Performance Summary.

		MOSTLY DONE...

	x	Refocus the SMPdesign chapter to partitioning.

	x	Fill out RCU patterns, also corresponding code and
		performance summary.

		x	RCU and atomic list movement (Josh?)

		x	RCU and reference counts

		x	Converting rwlocks to RCU

	x	Consider adding history of RCU from OSR paper (low priority)

	o	Other stuff Paul McKenney has from previous publications.

	o	Other stuff that must be generated or harvested from
		elsewhere.

	o	Note that pthread_mutex permits static initialization.
		(Windows apparently does not.)

	x	Finish whymemorybarriers.tex

o	Come up with suitable API for example code, so that a single
	program can be used for each example, covering the pthreads,
	Linux-kernel-module, and other C-language environments.

	double dgettimeofday(void);
	void spin_lock_init(spinlock_t *sp); /* Must be called. */
	void spin_lock(spinlock_t *sp);
	int spin_trylock(spinlock_t *sp);
	void spin_unlock(spinlock_t *sp);
	thread_id_t create_thread(void *(*func), void *arg);
	void *wait_thread(pthread_id_t tid);

	-- See CodeSamples/pthreads/api-pthreads.h for a start.

	Clearly Java will need its own code.

------------------------------------------------------------------------

x	Add the various livejournal posts on memory order, especially
	the one on transitivity.  Consider also adding Will Deacon's
	January 6 2011 QoS story.

x	Fix the references in defer/rcu*.tex to "green" and "red" cells
	in the various tables.  Change to refer to the symbols actually
	used.

x	Verify versions of cartoons.  (Need whippersnapper in intro)

x	Upgrade Quick Quiz scripts to remove material in second parameter
	when aggregating into "answers" section.  Trivialize by changing
	required format to put closing "}" before end tag.  Then get rid
	of the special cases for code formatting in QuickQuiz answers.

x	Move API description to an appendix.

x	Fill out material.

	x	Finish converting WhyRCU LWN series.

	x	Add cautions to analysis section.  Steve's and
		my "Integrating and Validating dynticks and Preemptable RCU"
		could be a good start, comparing to the rcutree.c
		dynticks implementation.

	x	Also add "what to do instead of parallel programming"
		section.

x	Move the formal-verification sections to an appendix.

x	Add the dyntick verification. (~/paper/KernelJanitor/RCU/dyntick/LWN)

x	Fix rcutorture.h to correctly compute from time.  Allow the
	test time as a third argument.

x	Reference counting: basic problem: make data structure stick
	around while you are incrementing the reference count.

	Strategies: lock, pre-existing reference, atomic check for
		zero reference count coupled with GC/RCU, various
		non-blocking-synchronization tricks.

x	Copy rcutree full data-structure diagram to section describing
	force_quiescent_state() callee traversal of the leaf rcu_node
	structures.  Decorate with right-facing arrow to show search
	direction.  Or perhaps copy diagram that shows initialization
	state.

x	Add \RevisedFrom to origpub.tex to allow something like:

		The ideas discussed in Section... were originally presented
		in XXX\cite{}.

	Allow ranges of sections for OriginallyFrom.

x	Discuss other textbooks in intro.

x	Figure out some way to get section headers between "Answers to
	Quick Quizzes" for different chapters.  Want to avoid any
	section header for a chapter that has no "quick quizzes".
	Not a big deal, nice to have.

x	Add a "Tools of the Trade" chapter before "SMP Synchronization
	Design" giving an overview of how locking and atomic operations
	work.  Need simple fork/join, along with foreshadowing of how
	good design can simplify implementation.

x	Add variables to rest of examples in toyrcu.tex.

x	In the double-ended-queue code, consider renaming the
	"enqueue" operations to "push" and the "dequeue" operations
	to "pop".

x	Need a chapter on counting.

x	Add the separate-thread approach suggested unwittingly by
	Christoph Lameter to the count.tex chapter.

x	Embed fonts from .eps files: http://www.wkiri.com/today/?p=60
	See the sed script:

	cat original_graphics.eps | sed 's+Times-Bold+NimbusSanL-Bold+g' |\
	sed 's+Times-Roman+NimbusSanL-Regu+g' |\
	sed 's+Times+NimbusSanL-Regu+g' |\
	sed 's+Helvetica-BoldOblique+NimbusSanL-BoldItal+g' |\
	sed 's+Helvetica-Oblique+NimbusSanL-ReguItal+g' |\
	sed 's+Helvetica-Bold+NimbusSanL-Bold+g' |\
	sed 's+Helvetica-Bold-iso+NimbusSanL-Bold+g' |\
	sed 's+Helvetica+NimbusSanL-Regu+g' |\
	sed 's+Helvetica-iso+NimbusSanL-Regu+g' |\
	sed 's+Symbol+StandardSymL+g' > new_graphics.eps

	And the additional conversions:

	Courier 	NimbusMonL-Regu
	Courier-Bold 	NimbusMonL-Bold
	Courier-Oblique 	NimbusMonL-ReguObli
	Courier-BoldOblique 	NimbusMonL-BoldObli

x	Change calls to bzero() to memset(). [Horst]

x	dvipdf is quite slow, especially on 1-up setups:

	time dvips perfbook
		real    0m3.405s
		user    0m3.184s
		sys     0m0.224s
	time psnup -2 perfbook.ps perfbook-2up.ps
		real    0m0.852s
		user    0m0.248s
		sys     0m0.308s
	time ps2pdf perfbook-2up.ps perfbook-2up.pdf
		real    0m25.383s
		user    0m24.646s
		sys     0m0.276s

	time dvipdf perfbook
		real    1m31.456s
		user    0m48.347s
		sys     0m45.599s

	Could parallelism be profitably applied here?  ;-)
	Or perhaps just careful avoidance of quadratic algorithms?

	Use pdflatex instead, much faster!

x	Explicitly permit reformatted copies (single-column, ebook, etc.).

x	Change from "preemptable" to "preemptible". [Horst]

x	Produce an HTML version.  One possibility is latex2html
	and scripting, and tamasrepus suggests
	http://johnmacfarlane.net/pandoc/.

	Challenges include the scripting and special latex macros
	that hendle the quick quizzes and the illustration credits.

x	Add discussion of seqlock to defer chapter between refcnt and RCU.
	Use trivial example showing consistency.  Note inapplicability
	to pointer-based data structures.

x	Update chain of RCU implementations to add http://lttng.org/urcu.

x	Line up reviewers.  Need a range of viewpoints, and especially
	a range of familiarity with SMP coding.  [Now that the book is
	public, just let them line themselves up!]

x	Work out who else to invite to contribute.  Possibilities include:
	[Now that the book is public, just let them line themselves up!]

	o	Real-life use of RCU: Dipankar Sarma, Steve Hemminger,
		Maneesh Soni, David Howells, ...

		o	Nick Piggin: radix tree.

		o	Robert Ohlsson: fib trie.

		o	Steve Hemminger: brlock replacement.

		o	SELinux guys.

	o	Extreme scalability: Christoph Lameter, Jesse Barnes,
		other SGI and ex-SGI folks.

	o	Various architecture maintainers for issues on abstract
		CPUs and other hardware issues.

	o	Chinese angle: guys from Tsinghua U. in Beijing.

	o	Stuff from Rusty's Unreliable Guides.

	o	Rusty Commentary version.

	o	Tom Hart on RCU performance measurement.
		Perhaps also on Hazard Pointers (maybe Maged Michael
		as well or instead).

	o	Josh Triplett on RCU error checking in sparse.

	o	Val Henson on academic research on parallel SMP.

	o	http://globaltext.org/ -- mailto:rwatson@terry.uga.edu
		+1-706-542-3706.  "Wiki-based" textbooks.  This is
		the Terry College of Business at U. of Georgia,
		but what is wrong with a little cross-pollenation?

	The usual experience is that somewhere between 20% and 50%
	of the people who get excited about contributing really will
	do so.

x	Need to make a chapter on locking (as opposed to the current
	one on synchronization).  [IN PROGRESS]

x	Consider using parallel maze solving as an example.  See:

	http://www.futurechips.org/tips-for-power-coders/parallel-programming-tutorial-parallelizing-somewhat-complex-code-part-1.html

	A better approach might assign one processor to each end of
	the maze.  Chopping the maze up into blocks might help, but
	this would likely encounter pathological cases with lots
	of boundary crossing.

	Another approach is to start each CPU at a random point in
	the maze (in addition to the one each at start and end).
	Give each CPU a color, and when there is a path from beginning
	to end, you are done.  Possibly an application for parallel
	union-find.  ;-)

x	Update TM section based on OSR "grass is greener" paper and
	relevant paragraph from urcu paper.

x	Move OSR RCU history paper to appendix.

x	Add mechanism to automatically display authorhood.

x	My local view adds a modified book.cls that adds the bibliography
	to the table of contents.  However, the license on book.cls forbids
	distributing it without also distributing latex.  Need to come up
	with a better way!

	In the meantime, edit your local copy of book.cls and remove the
	"*" following the "chapter" directive in "thebibliography" macro.

x	Add support for real 64-bit x86 builds.

x	Use "inkscape --export-pdf=fn.pdf fn.svg" to create PDFs from
	inkscape .svg files.  Alternatively:

		inkscape --export-pdf=fn.pdf --export-latex fn.svg

	gets an fn.pdf_tex file that allows text in the diagram to
	be typeset by Latex, allowing things like references and
	citations to be automatically generated within inkscape drawings.

x	Add Hazard-Pointer section between refcnt and seqlock sections
	of Deferred Processing chapter.

x	Add SNZI to the counting chapter.

x	In the intro/intro.tex quick-quiz answer citing Herb
	Sutter's "Effective Concurrency" column, also cite
	Anthony Williams's book.

x	Put RCUApplicability.eps into the defer chapter.

x	Change "non-idempotent" to "irrevocable" and "idempotent"
	to "revocable" in TM discussion.

x	Add a chapter on performance analysis and benchmarking.

X	Move memory-allocation code from "Partitioning and Synchronization
	Design" to "Data Ownership"?

x	Cite http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
	for reference on CUDA, MPI, etc.

x	Change ASCII double quotes to TeX quotes. [Horst]

x	Use the existence problem as motivation for RCU.  Show some
	solutions in the locking chapter.  See the day-2 ACACES slideset.
	See also the RCU tutorial and the SC22WG14 email.

	o	spin_lock_irqsave(): Critical sections must guarantee forward
		progress against everything except NMI handlers.

	o	raw_spin_lock(): Critical sections must guarantee forward
		progress against everything except IRQ (including softirq)
		and NMI handlers.

	o	spin_lock(): Critical sections must guarantee forward
		progress against everything except IRQ (again including
		softirq) and NMI handlers and (given CONFIG_PREEMPT_RT)
		higher-priority realtime tasks.

	o	mutex_lock(): Critical sections need not guarantee
		forward progress, as general blocking is permitted.

	The other issue is the scope of the lock.  The Linux kernel has
	the following:

	o	BKL: global scope.

	o	Everything else: scope defined by the use of the underlying
		lock variable.

	One of the many reasons that we are trying to get rid of
	BKL is because it combines global scope with relatively weak
	forward-progress guarantees.

x	First edition changes:

	X	Move "Data Structures" to precede "Applying RCU".
		
		Add a tree or two.

	x	Change "Applying RCU" name to "Putting It All Together".

		Add examples from dcache, semaphore mapping, ...

	Publish v0.9-yyyy.mm.dd, call for comments.  Best 3? sets of comments
	get a dead-tree signed first edition.

	Which will be v1.0-yyyy.mm.dd.

x	Consider linking quick quizzes to the answers.

X	Convert bloatwatch RCU and add to appendix/rcuimpl.

X	RCU implementations appendix:

	x	RCU requirements/desiderata.  (See paper on
		desk at home for start, along with the usual
		papers.)  Also look in the advsync/rcu.tex section
		on RCU, which needs to be folded into the
		sections in defer/

	o	Uniprocessor RCU implementation.  Simplify
		the current one so that the rcu and rcu_bh
		implementations are identical, as appropriate
		for a device not subject to networking DoS
		attacks.

	x	Toy implementations.  Follow wikipedia, but add
		reference-count approach.

	x	Performance and scalability for toy versions.

	x	User-level RCU implementations (after linux.conf.au
		2008)

	o	Non-realtime kernel implementations.  Take from
		dissertation and from LWN articles.  Or, given
		treercu, just cite dissertation.

	x	Realtime kernel implementations.

	o	RCU priority boosting.  I suppose getting a good
		boosting implementation would help...

	x	Hierarchical implementation (after November 20, 2008
		due to LWN subscriber-only period).
		Also propagate references into
		appendix/formal/dyntickrcu.tex.

		Also add code walkthrough.  You will need to do
		the walkthrough to find rcutree bugs anyway...

	o	Derive RCU requirements from list of examples,
		propagate list up to defer chapter.
		(rcufundamental.tex)

	o	Crisply present advantages and disadvantages of
		RCU in defer chapter.

x	https://lwn.net/Articles/667593/ to QC futures.  Plus
	https://lwn.net/Articles/667720/.

x	Expant SAT-solver section with recent experience.

x	Move intro/hwhabits.tex to cpu/.  Add sections with increasingly
	detailed pictures of what a CPU does, including speed-of-light
	round-trip time (~3 cm, or a bit more than an inch) at 5GHz.
	3D fabrication might help, but watch the fabrication cost and
	the heat dissipation!!!  Also, electrons travel from 3-30 times
	more slowly in silicon than light does in a vacuum, and even
	more slowly if there are levels of clocked logic in the way.

	Include actual numbers/graphs from real machines.  Show effects
	of various forms of misbehavior.

x	Add Makefile (and comments) to CodeSamples directories.
	(defer and SMPdesign done.)

x	Update Test/perf.gpl set of tests for atomic operations and
	cache-line movement and add to CodeSamples/cpu, then rerun
	this to create a new table of atomic operations.

X	Reorganize memory-ordering introduction with Michel Dagenais's
	feedback in mind.  See email dated August 20, 2012.

x	Add more terms to vocabulary.  More TM stuff, for example.
	Also RCU nomenclature [done].

------------------------------------------------------------------------

PROGRESS

August 5, 2006: 76 pages
August 12, 2006: 84 pages
August 20, 2006: 90 pages
September 3, 2006: 113 pages
September 17, 2006: 123 pages
-- LJ paper, discussions on memory barriers --
October 11, 2006: 111 pages
November 5, 2006: 135 pages
October 20, 2008: 176 pages
October 30, 2008: 199 pages
November 7, 2008: 205 pages
Change page-layout parameters...
November 7, 2008: 168 pages
November 9, 2008: 200 pages
November 16, 2008: 220 pages
November 30, 2008: 232 pages
December 8, 2008: 250 pages
December 23, 2008: 266 pages
January 13, 2009: 276 pages
January 31, 2009: 288 pages
February 28, 2009: 292 pages
May 24, 2009: 302 pages
July 3, 2009: 308 pages
February 21, 2010: 342 pages
February 21, 2010: 354 pages
Change page-layout parameters...
February 20, 2011: 379 pages
March 23, 2011: Test edition posted on lulu.com: http://www.lulu.com/content/paperback-book/is-parallel-programming-hard-and-if-so-what-can-you-do-about-it/10357070