CIS565-Fall-2022 · vivienldy · Sep 15, 2022 · Sep 17, 2022 · Sep 17, 2022 · Sep 17, 2022
diff --git a/README.md b/README.md
@@ -3,12 +3,112 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Dongying Liu
+  * [LinkedIn](https://www.linkedin.com/in/dongying-liu/), [personal website](https://vivienliu1998.wixsite.com/portfolio)
+* Tested on:  Windows 11, i7-11700 @ 2.50GHz, NVIDIA GeForce RTX 3060
 
-### (TODO: Your README)
+# Project Description
+This project is the implementation of GPU stream compaction in CUDA, which will be used in the later path tracer project.
+Stream compaction inculdes generally three steps:
+1) Boolean Mapping: mapping the input date to a boolean array, where 1 if corresponding element meets criteria, 0 if element does not meet criteria.
+2) Exclusive Scamn: run exclusive scan on the boolean array and save the result to a temporary array.
+3) Scatter: According to the boolean array, if the element is 1, use the corresponding result from the scan temporary array as index, write the input data to the final result array.  
+<img src="/img/stream_compaction.jpg"  width="300">
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+In order to implement the GPU stream compaction, the following algorithem is implemented in this project:
+
+### CPU scan (exclusive prefix sum)
+ For performance comparison and better understanding before moving to GPU.
+### CPU Stream Compaction without scan and CPU Stream Compaction with scan
+ For performance comparison and better understanding before moving to GPU.
+### Naive GPU Scan (exclusive prefix sum)
+ As the example shows, because each GPU thread might read and write to the same index of array, two buffer is needed for this naive algorithem. One is for read only and one is for write only, switch the buffer is required after every iteration.  
+ <img src="/img/naive_gpu_scan.jpg"  width="300">
+### Work-Efficient GPU Scan
+ This algothrithm is based on the one presented by Blelloch. As the example shows, the algorithem consists of two phases: up-sweep and down-sweep.  
+ The up-sweep is also know as a parallel reduction, because after this phase, the last node in the array holds the sum of all nodes in the array.  
+ In the down-sweep phase, we use partial sum from the up-sweep phase to build the scan of the array. Start by replacing the last node to zero, and for each step, each node at the current level pass its own value to its left child, its right child holds the sum of itself and the former left child.  
+ <img src="/img/efficient_gpu_scan.jpg"  width="300">
+
+# Performance Analysis
+For the performance analysis, I used blocksize of 256. I was testing on four array size, which are 2^12, 2^16, 2^20, 2^24. The less time consuming the faster the program run, the better the performance is. The time for GPU does not record the cudaMalloc and cudaMemcpy time.
+
+## Scan (exclusive prefix sum)
+<img src="/img/scan_analysis_large.png"  width="700">  <img src="/img/scan_analysis.gif"  width="700">
+* As the left graph shows, when array size is from 2^8 - 2^16, CPU actually runs faster than GPU naive and work-efficient. I think the reason is for a certain number of data, the time cost on GPU for reading data from global memory can not leverage the time saved by the parallel calculation. So, the performance is quit the same or not as good as CPU.  
+  However, when the amount of data reached a point, as the right graph shows, for my test is 2^20, GPU efficient algorithem starts to act way better than CPU and GPU naive algothrithm.
+
+* No matter how large the array size is, the performance for thrust is always the best, even much better than GPU efficient. I think the reason is, for my version of GPU efficient algorithm, some thread is not used in the later iteration, but they still need to wait for the other active thread in the same warp to finish their work to become available for other work again.
+
+* For the two GPU version, naive and work-efficient, naive cost more time, sometime even as same as CPU version. I think the most important reason is because we are ping-pong from two buffer every iteration, the cudaMemcpy cost most of the time and is very time consuming.
+
+## Stream Compacton
+<img src="/img/stream_compaction_large.png"  width="700"> <img src="/img/stream_compaction_analysis.gif"  width="700">
+* Same with Scan, as the left graph shows, when array size is from 2^8 - 2^16, the two CPU functions run faster than GPU. And the reason is same as above. 
+  Same, when the amount of data reached a point, as the right graph shows, for my test is 2^20, GPU efficient algorithm starts to act way better than the two CPU functions.
+ * What interesting here is, the CPU without scan perfored much better than CPU with scan. So apparently, when not running parallelly, the two extra buffer created when using scan in stream compaction are very time consuming. It only acts efficiently when running parallely.
+
+## Output of the Test Program
+```
+****************
+** SCAN TESTS **
+****************
+    [  17   0   9  24   3  46  31  40   8  44  48   9  24 ...  49   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 25.2266ms    (std::chrono Measured)
+    [   0  17  17  26  50  53  99 130 170 178 222 270 279 ... 410923873 410923922 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 24.2735ms    (std::chrono Measured)
+    [   0  17  17  26  50  53  99 130 170 178 222 270 279 ... 410923806 410923816 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 21.4419ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 21.3356ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 9.57216ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 9.46726ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.989184ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 1.25235ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   2   1   1   2   1   0   0   3   0   2   3   3 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 36.6462ms    (std::chrono Measured)
+    [   3   2   1   1   2   1   3   2   3   3   3   2   3 ...   2   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 35.2236ms    (std::chrono Measured)
+    [   3   2   1   1   2   1   3   2   3   3   3   2   3 ...   1   2 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 94.3538ms    (std::chrono Measured)
+    [   3   2   1   1   2   1   3   2   3   3   3   2   3 ...   2   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 10.7924ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 11.4318ms    (CUDA Measured)
+    passed
+```
+
+## Bloopers
+This is more like a question than blooper.  
+I used pow() in kernel function at the first time and pass two int to it. When array size reached 2^11, the algorithm began to act weiredly. It took me an hour to find out that the problem was caused by pow() I was using.  
+It looks like in kernel, pow is defined as 'double result = pow(double x, double y)'  
+And there is this powf() which is defined as 'float result = powf(float x, float y)'  
+So, I changed it to powf(), however, it again began to act weiredly when array size reaced 2^16, I changed all the power function to bitwise eventually.  
+However, is there any criteria we need to follow when using power on kernel?
 
diff --git a/img/efficient_gpu_scan.jpg b/img/efficient_gpu_scan.jpg
diff --git a/img/naive_gpu_scan.jpg b/img/naive_gpu_scan.jpg
diff --git a/img/scan_analysis.gif b/img/scan_analysis.gif
diff --git a/img/scan_analysis_large.png b/img/scan_analysis_large.png
diff --git a/img/stream_compaction.jpg b/img/stream_compaction.jpg
diff --git a/img/stream_compaction_analysis.gif b/img/stream_compaction_analysis.gif
diff --git a/img/stream_compaction_large.png b/img/stream_compaction_large.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,8 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+
+const int SIZE = 1 << 24; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -31,6 +32,7 @@ int main(int argc, char* argv[]) {
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
+
     // initialize b using StreamCompaction::CPU::scan you implement
     // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
     // At first all cases passed because b && c are all zeroes.
@@ -71,7 +73,7 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    //printArray(SIZE, c, false);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,15 +24,33 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = blockIdx.x * blockDim.x + threadIdx.x;
+            if (index >= n) {
+                return;
+            }
+            if (idata[index]) {
+                bools[index] = 1;
+            }
+            else {
+                bools[index] = 0;
+            }
         }
 
         /**
          * Performs scatter on an array. That is, for each element in idata,
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
+        __global__ void kernScatter(int n, int *odata, const int *idata, 
+                                                           const int *bools, const int *indices)
+        {
             // TODO
+            int index = blockIdx.x * blockDim.x + threadIdx.x;
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -20,6 +20,10 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            odata[0] = 0;
+            for (int k = 1; k < n; ++k) {
+                odata[k] = odata[k - 1] + idata[k - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -31,8 +35,16 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int resultSize = 0;
+            for (int i = 0; i < n; ++i) {
+                int num = idata[i];
+                if (num) {
+                    odata[resultSize] = num;
+                    resultSize++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return resultSize;
         }
 
         /**
@@ -43,8 +55,28 @@ namespace StreamCompaction {
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            // create bool array
+            for (int i = 0; i < n; ++i) {
+                odata[i] = idata[i] ? 1 : 0;
+            }
+
+            // scan 
+            int* scan = new int[n];
+            scan[0] = 0;
+            for (int i = 1; i < n; ++i) {
+                scan[i] = scan[i - 1] + odata[i - 1];
+            }
+
+            // scatter
+            int resultSize = 0;
+            for (int i = 0; i < n; ++i) {
+                if (odata[i]) {
+                    odata[scan[i]] = idata[i];
+                    resultSize++;
+                } 
+            }
             timer().endCpuTimer();
-            return -1;
+            return resultSize;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
@@ -3,6 +3,8 @@
 #include "common.h"
 #include "efficient.h"
 
+#define blockSize 256
+
 namespace StreamCompaction {
     namespace Efficient {
         using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +14,67 @@ namespace StreamCompaction {
             return timer;
         }
 
+        __global__ void kernUpSweep(int n, int* data, int depth) {
+            int index = blockIdx.x * blockDim.x + threadIdx.x;
+            if (index >= n) {
+                return;
+            }
+            //int offset1 = powf(2, depth);
+            //int offset2 = powf(2, depth + 1);
+            int offset1 = 1<<depth;
+            int offset2 = 1 << (depth+1);
+            if (index % offset2 == 0) {
+                data[index + offset2 - 1] += data[index + offset1 - 1];
+            }
+        }
+
+        __global__ void kernDownSweep(int n, int* data, int depth) {
+            int index = blockIdx.x * blockDim.x + threadIdx.x;
+            if (index >= n) {
+                return;
+            }
+            int offset1 = 1 << depth;
+            int offset2 = 1 << (depth + 1);
+            if (index % offset2 == 0) {
+                int t = data[index + offset1 - 1];
+                data[index + offset1 - 1] = data[index + offset2 - 1];
+                data[index + offset2 - 1] += t;
+            }
+        }
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
+        void scan(int n, int *odata, const int *idata) {       
+            int power = ilog2ceil(n);
+            int arraySize = pow(2, power);
+            dim3 blockPerGrid((arraySize + blockSize - 1) / blockSize);
+            dim3 threadPerBlock(blockSize);
+
+            int* dev_data;
+
+            // create memory
+            cudaMalloc((void**)&dev_data, arraySize * sizeof(int));
+            // set data and then copy the original data 
+            cudaMemset(dev_data, 0, arraySize * sizeof(int));
+            cudaMemcpy(dev_data, idata, arraySize * sizeof(int), cudaMemcpyHostToDevice);
+
             timer().startGpuTimer();
             // TODO
+            for (int i = 0; i < power; i++) {
+                kernUpSweep << <blockPerGrid, threadPerBlock >> > (arraySize, dev_data, i);
+            }
+            // set the root to 0
+            cudaMemset(dev_data + arraySize - 1, 0, sizeof(int));
+            for (int i = power - 1; i >= 0; i--) {
+                kernDownSweep << <blockPerGrid, threadPerBlock >> > (arraySize, dev_data, i);
+            }
             timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            // free memory
+            cudaFree(dev_data);
         }
 
         /**
@@ -31,10 +87,55 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
+            int power = ilog2ceil(n);
+            int arraySize = pow(2, power);
+            dim3 blockPerGrid((arraySize + blockSize - 1) / blockSize);
+            dim3 threadPerBlock(blockSize);
+
+            int* dev_idata;
+            int* dev_odata;
+            int* dev_boolBuffer;
+            int* dev_scanResultBuffer;
+
+            // malloc
+            cudaMalloc((void**)&dev_idata, arraySize * sizeof(int));
+            cudaMalloc((void**)&dev_odata, arraySize * sizeof(int));
+            cudaMalloc((void**)&dev_boolBuffer, arraySize * sizeof(int));
+            cudaMalloc((void**)&dev_scanResultBuffer, arraySize * sizeof(int));
+
+            // set data and copy data
+            // important for non power of two data!
+            // if not set to 0, when map to boolean, the extra data which is not 0 will cause damage
+            cudaMemset(dev_idata , 0, arraySize * sizeof(int)); 
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
             timer().startGpuTimer();
             // TODO
+            StreamCompaction::Common::kernMapToBoolean << <blockPerGrid, threadPerBlock>> > (arraySize, 
+                                                                                                                                                                              dev_boolBuffer,
+                                                                                                                                                                              dev_idata);
+            cudaMemcpy(dev_scanResultBuffer, dev_boolBuffer, arraySize * sizeof(int), cudaMemcpyDeviceToDevice);
+
+            for (int i = 0; i < power; ++i) {
+                kernUpSweep << <blockPerGrid, threadPerBlock >> > (arraySize, dev_scanResultBuffer, i);
+            }
+            // set the root to 0
+            cudaMemset(dev_scanResultBuffer + arraySize - 1, 0, sizeof(int));
+            for (int i = power - 1; i >= 0; --i) {
+                kernDownSweep << <blockPerGrid, threadPerBlock >> > (arraySize, dev_scanResultBuffer, i);
+            }
+
+            StreamCompaction::Common::kernScatter << <blockPerGrid, threadPerBlock >> > (arraySize,
+                                                                                                                                                                 dev_odata, dev_idata,
+                                                                                                                                                                 dev_boolBuffer, dev_scanResultBuffer);
             timer().endGpuTimer();
-            return -1;
+
+            int* host_scanResultBuffer = new int[arraySize];
+            cudaMemcpy(host_scanResultBuffer, dev_scanResultBuffer, arraySize * sizeof(int), cudaMemcpyDeviceToHost);
+
+            int resultCount = host_scanResultBuffer[arraySize - 1];
+            cudaMemcpy(odata, dev_odata, resultCount * sizeof(int), cudaMemcpyDeviceToHost);
+            return resultCount;
         }
     }
 }