Skip to content

Commit

Permalink
Merge pull request #22 from EPCCed/ks-updates1
Browse files Browse the repository at this point in the history
Various minor improvements
  • Loading branch information
kevinstratford authored Nov 14, 2024
2 parents 2854fbb + 3d54763 commit 12ae223
Show file tree
Hide file tree
Showing 11 changed files with 5,404 additions and 6 deletions.
317 changes: 317 additions & 0 deletions images/ks-schematic-host-device-recent.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
458 changes: 458 additions & 0 deletions images/ks-threads-1d-1block.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,103 changes: 1,103 additions & 0 deletions images/ks-threads-1d-3blocks.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3,506 changes: 3,506 additions & 0 deletions images/ks-threads-2d-4blocks.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed images/ks-threads-blocks-grids.jpeg
Binary file not shown.
Binary file removed images/ks-threads-blocks.jpeg
Binary file not shown.
Binary file removed images/ks-threads.jpeg
Binary file not shown.
16 changes: 15 additions & 1 deletion section-1.01/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,14 +131,28 @@ For AMD GPUs, the picture is essentially similar, although some of the
jargon differs.


## Host/device picture
## Host/device (historical) picture

GPUs are typically 'hosted' by a standard CPU, which is responsible
for orchestration of GPU activities. In this context, the CPU and GPU
are often referred to as *host* and *device*, respectively.

![Host/device schematic](../images/ks-schematic-host-device.svg)

There is clearly potential for a bottleneck in transfer of data
between host and device.


A modern configuration may see the host (a multi-core CPU) host 4-8
GPU devices.


## Host/device picture

The most recent hardware has attempted to address the potential
bottleneck in host/dvice transfer by using a higher bandwidth
"chip-to-chip " connection.

![Host/device schematic](../images/ks-schematic-host-device-recent.svg)

This model here is typically 1 CPU associated with 1 GPU.
6 changes: 3 additions & 3 deletions section-1.02/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ organisation of threads.
If we have a one-dimensional problem, e.g., an array, we can assign
individual elements to threads.

![A single thread block in one dimension](../images/ks-threads.jpeg)
![A single thread block in one dimension](../images/ks-threads-1d-1block.svg)

Threads are typically executed in groups of 32, known as a *warp*
(the terminology is borrowed from weaving).
Expand All @@ -25,7 +25,7 @@ Threads are typically executed in groups of 32, known as a *warp*
Groups of threads are further organised into blocks. In our
one-dimensional picture we may have:

![Threads and blocks in one dimension](../images/ks-threads-blocks.jpeg)
![Threads and blocks in one dimension](../images/ks-threads-1d-3blocks.svg)

Blocks are scheduled to SMs.

Expand All @@ -39,7 +39,7 @@ the maximum number of threads per block is 1024. A value of
For two-dimensional problems (e.g., images) it is natural to have
a two-dimensional Cartesian picture:

![Threads and blocks in two dimensions](../images/ks-threads-blocks-grids.jpeg)
![Threads and blocks in two dimensions](../images/ks-threads-2d-4blocks.svg)

The arrangement of blocks is referred to as the *grid* in CUDA.

Expand Down
2 changes: 1 addition & 1 deletion section-2.01/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ via `cudaMemcpy()`. Schematically,
```

These are *blocking* calls: they will not return until the data has been
stored in GPU memory (or and error has occurred).
stored in GPU memory (or an error has occurred).

Formally, the API reads
```
Expand Down
2 changes: 1 addition & 1 deletion section-2.03/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ A suggested procedure is:

If we had not used `cudaMemset()` to initialise the device values for
the matrix, what other options to initialise these values on the device
are available to us? (cudaMemset()` is limited in that it can only be
are available to us? (`cudaMemset()` is limited in that it can only be
used to initialise array values to zero, but not to other, non-zero, values.

For your best effort for the kernel, what is the overhead of the actual
Expand Down

0 comments on commit 12ae223

Please sign in to comment.