Lesson_Materials/Data_and_Dependencies/index.html

﻿<!DOCTYPE html>

<html>
	<head>
	    <meta charset="utf-8">
		<link rel="stylesheet" href="../common-revealjs/css/reveal.css">
		<link rel="stylesheet" href="../common-revealjs/css/theme/white.css">
		<link rel="stylesheet" href="../common-revealjs/css/custom.css">
		<script>
			// This is needed when printing the slides to pdf
			var link = document.createElement( 'link' );
			link.rel = 'stylesheet';
			link.type = 'text/css';
			link.href = window.location.search.match( /print-pdf/gi ) ? '../common-revealjs/css/print/pdf.css' : '../common-revealjs/css/print/paper.css';
			document.getElementsByTagName( 'head' )[0].appendChild( link );
		</script>
		<script>
		    // This is used to display the static images on each slide,
			// See global-images in this html file and custom.css
			(function() {
				if(window.addEventListener) {
					window.addEventListener('load', () => {
						let slides = document.getElementsByClassName("slide-background");

						if (slides.length === 0) {
							slides = document.getElementsByClassName("pdf-page")
						}

						// Insert global images on each slide
						for(let i = 0, max = slides.length; i < max; i++) {
							let cln = document.getElementById("global-images").cloneNode(true);
							cln.removeAttribute("id");
							slides[i].appendChild(cln);
						}

						// Remove top level global images
						let elem = document.getElementById("global-images");
						elem.parentElement.removeChild(elem);
					}, false);
				}
			})();
		</script>
		
	</head>
	<body>
		<div class="reveal">
			<div class="slides">
				<div id="global-images" class="global-images">
					<img src="../common-revealjs/images/sycl_academy.png" />
					<img src="../common-revealjs/images/sycl_logo.png" />
					<img src="../common-revealjs/images/trademarks.png" />
				</div>
				<!--Slide 1-->
				<section class="hbox">
					<div class="hbox" data-markdown>
						## Data and Dependencies
					</div>
				</section>
				<!--Slide 2-->
				<section class="hbox" data-markdown>
					## Learning Objectives
					* Learn about how to create dependencies between kernel functions
					* Learn about how to move data between the host and device(s)
					* Learn about the differences between the buffer/accessor and USM data management models
					* Learn how to represent basic data flow graphs
                </section>
				<!--Slide 3-->
				<section>
					<div class="hbox" data-markdown>
						#### Access/buffer and USM
					</div>
					<div class="container" data-markdown>
						There are two ways to move data and create dependencies between kernel functions in SYCL
					</div>
					<div class="container">
						<div class="col" data-markdown>
							Buffer/accessor data movement model
							<br/>
							* Data dependencies analysis
							* Implicit data movement
						</div>
						<div class="col" data-markdown>
							USM data movement model
							<br/>
							* Manual chaining of dependencies
							* Explicit data movement
						</div>
					</div>
				</section>
				<!--Slide 4-->
				<section>
					<div class="hbox" data-markdown>
						#### Creating dependencies
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/data_dependency.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* Kernel A first writes to the data
							* Kernel B then reads from and writes to the data
							* This creates a read-after-write (RAW) relationship
							* There must be a dependency created between Kernel A and Kernel B
						</div>
					</div>
				</section>
				<!--Slide 5-->
				<section>
					<div class="hbox" data-markdown>
						#### Moving data
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/data_movement.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* Here both kernel functions are enqueued to the same device, in this case a GPU
							* The data must be copied to the GPU before the Kernel A is executed
							* The data must remain on the GPU for Kernel B to be executed
							* The data must be copied back to the host after Kernel B has executed
						</div>
					</div>
				</section>
				<!--Slide 6-->
				<section>
					<div class="hbox" data-markdown>
						#### Data flow
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/data_flow.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* Combining kernel function dependencies and the data movement dependencies we have a final data flow graph
							* This graph defines the order in which all commands must execute in order to maintain consistency
							* In more complex data flow graphs there may be multiple orderings which can achieve the same consistency
						</div>
					</div>
				</section>
				<!--Slide 7-->
				<section>
					<div class="hbox" data-markdown>
						#### Data flow with buffers and accessors
					</div>
					<div class="container">
						<div class="col">
							<code class="code-100pc"><pre>
sycl::buffer buf {data, sycl::range{1024}};

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, cgh};

  cgh.parallel_for&lt;kernel_a&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc{buf, cgh};

  cgh.parallel_for&lt;kernel_b&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.wait();
							</code></pre>
						</div>
						<div class="col" data-markdown>
							* The buffer/accessor data management model data model is descriptive
							* Dependencies and data movement is inferred from the access requirements of command groups
							* The SYCL runtime is responsible for guaranteeing that data dependencies and consistency are maintained
						</div>
					</div>
				</section>
				<!--Slide 8-->
				<section>
					<div class="hbox" data-markdown>
						#### Data flow with buffers and accessors
					</div>
					<div class="container">
						<div class="col">
							<code class="code-100pc"><pre>
<mark>sycl::buffer buf {data, sycl::range{1024}};</mark>

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, cgh};

  cgh.parallel_for&lt;kernel_a&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, cgh};

  cgh.parallel_for&lt;kernel_b&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.wait();
							</code></pre>
						</div>
						<div class="col" data-markdown>
							* A `buffer` object is responsible for managing data between the host and one or more devices
							* It is also responsible for tracking dependencies on the data it manages
							* It will also allocating memory and move data when necessary.
							* Note that a `buffer` is lazy and will not allocate or move data until it is asked to
						</div>
					</div>
				</section>
				<!--Slide 9-->
				<section>
					<div class="hbox" data-markdown>
						#### Data flow with buffers and accessors
					</div>
					<div class="container">
						<div class="col">
							<code class="code-100pc"><pre>
sycl::buffer buf {data, sycl::range{1024}};

gpuQueue.submit([&](sycl::handler &cgh) {
  <mark>sycl::accessor acc{buf, cgh};</mark>

  cgh.parallel_for&lt;my_kernel&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  <mark>sycl::accessor acc{buf, cgh};</mark>

  cgh.parallel_for&lt;my_kernel&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.wait();
							</code></pre>
						</div>
						<div class="col" data-markdown>
							* An `accessor` object is responsible for describing data access requirements
							* It describes what data a kernel function is accessing and how it is accessing it
							* The `buffer` object uses this information to create infer dependencies and data movement
						</div>
					</div>
				</section>
				<!--Slide 10-->
				<section>
					<div class="hbox" data-markdown>
						#### Data flow with buffers and accessors
					</div>
					<div class="container">
						<div class="col">
							<code class="code-100pc"><pre>
buf = sycl::buffer(data, sycl::range{1024});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, <mark>cgh</mark>};

  cgh.parallel_for&lt;my_kernel&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  sycl::accessor acc {buf, <mark>cgh</mark>};

  cgh.parallel_for&lt;my_kernel&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      acc[idx] = /* some computation */
    });
});

gpuQueue.wait();
							</code></pre>
						</div>
						<div class="col" data-markdown>
							* Associating the `accessor` object with the `handler` connects the access dependency to the kernel function
							* It also associates the access requirement with the device being targeted
						</div>
					</div>
				</section>
				<!--Slide 11-->
				<section>
					<div class="hbox" data-markdown>
						#### Data flow with USM
					</div>
					<div class="container">
						<div class="col">
							<code class="code-100pc"><pre>
auto devicePtr =
  sycl::malloc_device&lt;int&gt;(1024, gpuQueue);

auto e1 = gpuQueue.memcpy(devicePtr, data, sizeof(int));

auto e2 = gpuQueue.parallel_for&lt;kernel_a&gt;(
  sycl::range{1024}, e1, [=](sycl::id&lt;1&gt; idx) {
    devicePtr[idx] = /* some computation */
  });

auto e3 = gpuQueue.parallel_for&lt;kernel_b&gt;(
  sycl::range{1024}, e2, [=](sycl::id&lt;1&gt; idx) {
    devicePtr[idx] = /* some computation */
  });

auto e4 = gpuQueue.memcpy(data, devicePtr,
                          sizeof(int), e3);

e4.wait();

sycl::free(devicePtr, gpuQueue);
							</code></pre>
						</div>
						<div class="col" data-markdown>
							* The USM data management model data model is prescriptive
							* Dependencies are defined explicitly by passing around `event` objects
							* Data movement is performed explicitly by enqueuing `memcpy` operations
							* The user is responsible for ensuring data dependencies and consistency are maintained
						</div>
					</div>
				</section>
				<!--Slide 12-->
				<section>
					<div class="hbox" data-markdown>
						#### Data flow with USM
					</div>
					<div class="container">
						<div class="col">
							<code class="code-100pc"><pre>
auto devicePtr =
  sycl::malloc_device&lt;int&gt;(1024, gpuQueue);

<mark>auto e1</mark> = gpuQueue.memcpy(devicePtr, data, sizeof(int));

<mark>auto e2</mark> = gpuQueue.parallel_for&lt;kernel_a&gt;(
  sycl::range{1024}, <mark>e1</mark>, [=](sycl::id&lt;1&gt; idx) {
    devicePtr[idx] = /* some computation */
  });

<mark>auto e3</mark> = gpuQueue.parallel_for&lt;kernel_b&gt;(
  sycl::range{1024}, <mark>e2</mark>, [=](sycl::id&lt;1&gt; idx) {
    devicePtr[idx] = /* some computation */
  });

<mark>auto e4</mark> = gpuQueue.memcpy(data, devicePtr,
                          sizeof(int), <mark>e3</mark>);

<mark>e4.wait();</mark>
							</code></pre>
						</div>
						<div class="col" data-markdown>
							* Each command enqueued to the `queue` produces an `event` object which can be used to synchronize with the completion of that command
							* Passing those `event` objects when enqueueing other commands creates dependencies
						</div>
					</div>
				</section>
				<!--Slide 13-->
				<section>
					<div class="hbox" data-markdown>
						#### Data flow with USM
					</div>
					<div class="container">
						<div class="col">
							<code class="code-100pc"><pre>
auto devicePtr =
  sycl::malloc_device&lt;int&gt;(1024, gpuQueue);

<mark>auto e1 = gpuQueue.memcpy(devicePtr, data, sizeof(int));</mark>

auto e2 = gpuQueue.parallel_for&lt;kernel_a&gt;(
  sycl::range{1024}, e1, [=](sycl::id&lt;1&gt; idx) {
    devicePtr[idx] = /* some computation */
  });

auto e3 = gpuQueue.parallel_for&lt;kernel_b&gt;(
  sycl::range{1024}, e2, [=](sycl::id&lt;1&gt; idx) {
    devicePtr[idx] = /* some computation */
  });

auto e4 = gpuQueue.memcpy(data, devicePtr,
                          sizeof(int), e3);

e4.wait();

sycl::free(devicePtr, gpuQueue);
							</code></pre>
						</div>
						<div class="col" data-markdown>
							* The `memcpy` member functions are used to enqueue data movement commands, moving the data to the GPU and then back again
						</div>
					</div>
				</section>
				<!--Slide 14-->
				<section>
					<div class="hbox" data-markdown>
						#### Concurrent data flow
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/concurrent_data_flow.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* If two kernels are accessing different buffers then there is no dependency between them
							* In this case the two kernels and their respective data movement are independent
							* By default `queue`s are out-of-order which means that these commands can execute in any order
							* They could also execute concurrently if the target device is able to do so
						</div>
					</div>
				</section>
				<!--Slide 15-->
				<section>
					<div class="hbox" data-markdown>
						#### Concurrent data flow with buffers and accessors
					</div>
					<div class="container">
						<div class="col">
							<code class="code-100pc"><pre>
sycl::buffer bufA {dataA, sycl::range{1024}};
sycl::buffer bufB {dataB, sycl::range{1024}};

gpuQueue.submit([&](sycl::handler &cgh) {
  <mark>sycl::accessor accA {bufA, cgh};</mark>

  cgh.parallel_for&lt;kernel_a&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      accA[idx] = /* some computation */
    });
});

gpuQueue.submit([&](sycl::handler &cgh) {
  <mark>sycl::accessor accB {bufB, cgh};</mark>

  cgh.parallel_for&lt;kernel_b&gt;(sycl::range{1024},
    [=](sycl::id&lt;1&gt; idx) {
      accB[idx] = /* some computation */
    });
});

gpuQueue.wait();
							</code></pre>
						</div>
						<div class="col" data-markdown>
							* The buffer/accessor data management model automatically infers dependencies
							* As each of the two kernel functions are accessing different `buffer` objects the SYCL runtime can infer there is no dependency between them
							* Data movement is still performed for the two kernels as normal
							* The two kernels and their respective copies collectively can be executed in any order
						</div>
					</div>
				</section>
				<!--Slide 16-->
				<section>
					<div class="hbox" data-markdown>
						#### Concurrent data flow with USM
					</div>
					<div class="container">
						<code class="code-100pc"><pre>
auto devicePtrA = sycl::malloc_device&lt;int&gt;(1024, gpuQueue);
auto devicePtrB = sycl::malloc_device&lt;int&gt;(1024, gpuQueue);

<mark>auto e1</mark> = gpuQueue.memcpy(devicePtrA, dataA, sizeof(int));
auto e2 = gpuQueue.memcpy(devicePtrB, dataB, sizeof(int));

<mark>auto e3</mark> = gpuQueue.parallel_for&lt;kernel_a&gt;(sycl::range{1024}, <mark>e1</mark>, [=](sycl::id&lt;1&gt; idx) {
  devicePtrA[idx] = /* some computation */ });

auto e4 = gpuQueue.parallel_for&lt;kernel_b&gt;(sycl::range{1024}, e2, [=](sycl::id&lt;1&gt; idx) {
  devicePtrB[idx] = /* some computation */ });

<mark>auto e5</mark> = gpuQueue.memcpy(dataA), devicePtrA, sizeof(int), <mark>e3</mark>);

auto e6 = gpuQueue.memcpy(dataB, devicePtrB, sizeof(int), e4);

<mark>e5.wait();</mark> e6.wait();

sycl::free(devicePtrA, gpuQueue);
sycl::free(devicePtrB, gpuQueue);
							</code></pre>
					</div>
					<div class="container" data-markdown>
						* Dependencies are defined explicitly
						* We don't create dependencies between kernel functions but we do create dependencies on the data movement
					</div>
				</section>
				<!--Slide 17-->
				<section>
					<div class="hbox" data-markdown>
						#### Concurrent data flow with USM
					</div>
					<div class="container">
						<code class="code-100pc"><pre>
auto devicePtrA = sycl::malloc_device&lt;int&gt;(1024, gpuQueue);
auto devicePtrB = sycl::malloc_device&lt;int&gt;(1024, gpuQueue);

auto e1 = gpuQueue.memcpy(devicePtrA, dataA, sizeof(int));
<mark>auto e2</mark> = gpuQueue.memcpy(devicePtrB, dataB, sizeof(int));

auto e3 = gpuQueue.parallel_for&lt;kernel_a&gt;(sycl::range{1024}, e1, [=](sycl::id&lt;1&gt; idx) {
  devicePtrA[idx] = /* some computation */ });

<mark>auto e4</mark> = gpuQueue.parallel_for&lt;kernel_b&gt;(sycl::range{1024}, <mark>e2</mark>, [=](sycl::id&lt;1&gt; idx) {
  devicePtrB[idx] = /* some computation */ });

auto e5 = gpuQueue.memcpy(dataA), devicePtrA, sizeof(int), e3);

<mark>auto e6</mark> = gpuQueue.memcpy(dataB, devicePtrB, sizeof(int), <mark>e4</mark>);

e5.wait(); <mark>e6.wait();</mark>

sycl::free(devicePtrA, gpuQueue);
sycl::free(devicePtrB, gpuQueue);
							</code></pre>
					</div>
					<div class="container" data-markdown>
						* The dependencies of each chain of commands is independant of the other
						* The two kernels and their respective copies collectively can be executed in any order
					</div>
				</section>
				<!--Slide 15-->
				<section>
					<div class="hbox" data-markdown>
						#### Which should you choose?
					</div>
					<div class="container" data-markdown>
						When should you use the buffer/accessor or USM data management models?
					</div>
					<div class="container">
						<div class="col" data-markdown>
							Buffer/accessor data movement model
							<br/>
							* If you want to guarantee consistency and avoid errors
							* If you want to iterate over your data flow quicker
						</div>
						<div class="col" data-markdown>
							USM data movement model
							<br/>
							* If you need to use USM
							* If you want more fine grained control over data movement
						</div>
					</div>
				</section>
				<!--Slide 16-->
				<section>
					<div class="hbox" data-markdown>
						## Questions
					</div>
				</section>
				<!--Slide 17-->
				<section>
					<div class="hbox" data-markdown>
						#### Exercise
					</div>
					<div class="container" data-markdown>
						Code_Exercises/Data_and_Dependencies/source
					</div>
					<div class="container" data-markdown>
						![SYCL](../common-revealjs/images/diamond_data_flow.png "SYCL")
					</div>
					<div class="container" data-markdown>
						Put together what you've seen here to create the above diamond data flow graph in either buffer/accessor or USM data management models
					</div>
				</section>
			</div>
		</div>
		<script src="../common-revealjs/js/reveal.js"></script>
		<script src="../common-revealjs/plugin/markdown/marked.js"></script>
		<script src="../common-revealjs/plugin/markdown/markdown.js"></script>
		<script src="../common-revealjs/plugin/notes/notes.js"></script>
		<script>
			Reveal.initialize({mouseWheel: true, defaultNotes: true});
			Reveal.configure({ slideNumber: true });
		</script>
	</body>
</html>