Merge pull request #22 from EPCCed/ks-updates1

Various minor improvements
EPCCed · Nov 14, 2024 · 12ae223 · 12ae223
2 parents 2854fbb + 3d54763
commit 12ae223
Show file tree

Hide file tree

Showing 11 changed files with 5,404 additions and 6 deletions.
diff --git a/images/ks-schematic-host-device-recent.svg b/images/ks-schematic-host-device-recent.svg
diff --git a/images/ks-threads-1d-1block.svg b/images/ks-threads-1d-1block.svg
diff --git a/images/ks-threads-1d-3blocks.svg b/images/ks-threads-1d-3blocks.svg
diff --git a/images/ks-threads-2d-4blocks.svg b/images/ks-threads-2d-4blocks.svg
diff --git a/images/ks-threads-blocks-grids.jpeg b/images/ks-threads-blocks-grids.jpeg
diff --git a/images/ks-threads-blocks.jpeg b/images/ks-threads-blocks.jpeg
diff --git a/images/ks-threads.jpeg b/images/ks-threads.jpeg
diff --git a/section-1.01/README.md b/section-1.01/README.md
@@ -131,14 +131,28 @@ For AMD GPUs, the picture is essentially similar, although some of the
 jargon differs.
 
 
-## Host/device picture
+## Host/device (historical) picture
 
 GPUs are typically 'hosted' by a standard CPU, which is responsible
 for orchestration of GPU activities. In this context, the CPU and GPU
 are often referred to as *host* and *device*, respectively.
 
 ![Host/device schematic](../images/ks-schematic-host-device.svg)
 
+There is clearly potential for a bottleneck in transfer of data
+between host and device.
+
 
 A modern configuration may see the host (a multi-core CPU) host 4-8
 GPU devices.
+
+
+## Host/device picture
+
+The most recent hardware has attempted to address the potential
+bottleneck in host/dvice transfer by using a higher bandwidth
+"chip-to-chip " connection.
+
+![Host/device schematic](../images/ks-schematic-host-device-recent.svg)
+
+This model here is typically 1 CPU associated with 1 GPU.
diff --git a/section-1.02/README.md b/section-1.02/README.md
@@ -14,7 +14,7 @@ organisation of threads.
 If we have a one-dimensional problem, e.g., an array, we can assign
 individual elements to threads.
 
-![A single thread block in one dimension](../images/ks-threads.jpeg)
+![A single thread block in one dimension](../images/ks-threads-1d-1block.svg)
 
 Threads are typically executed in groups of 32, known as a *warp*
 (the terminology is borrowed from weaving).
@@ -25,7 +25,7 @@ Threads are typically executed in groups of 32, known as a *warp*
 Groups of threads are further organised into blocks. In our
 one-dimensional picture we may have:
 
-![Threads and blocks in one dimension](../images/ks-threads-blocks.jpeg)
+![Threads and blocks in one dimension](../images/ks-threads-1d-3blocks.svg)
 
 Blocks are scheduled to SMs.
 
@@ -39,7 +39,7 @@ the maximum number of threads per block is 1024. A value of
 For two-dimensional problems (e.g., images) it is natural to have
 a two-dimensional Cartesian picture:
 
-![Threads and blocks in two dimensions](../images/ks-threads-blocks-grids.jpeg)
+![Threads and blocks in two dimensions](../images/ks-threads-2d-4blocks.svg)
 
 The arrangement of blocks is referred to as the *grid* in CUDA.
 

diff --git a/section-2.01/README.md b/section-2.01/README.md
@@ -92,7 +92,7 @@ via `cudaMemcpy()`. Schematically,
 ```
 
 These are *blocking* calls: they will not return until the data has been
-stored in GPU memory (or and error has occurred).
+stored in GPU memory (or an error has occurred).
 
 Formally, the API reads
 ```

diff --git a/section-2.03/README.md b/section-2.03/README.md
@@ -198,7 +198,7 @@ A suggested procedure is:
 
 If we had not used `cudaMemset()` to initialise the device values for
 the matrix, what other options to initialise these values on the device
-are available to us? (cudaMemset()` is limited in that it can only be
+are available to us? (`cudaMemset()` is limited in that it can only be
 used to initialise array values to zero, but not to other, non-zero, values.
 
 For your best effort for the kernel, what is the overhead of the actual