Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Dongying Liu #7

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open
112 changes: 106 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,112 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Dongying Liu
* [LinkedIn](https://www.linkedin.com/in/dongying-liu/), [personal website](https://vivienliu1998.wixsite.com/portfolio)
* Tested on: Windows 11, i7-11700 @ 2.50GHz, NVIDIA GeForce RTX 3060

### (TODO: Your README)
# Project Description
This project is the implementation of GPU stream compaction in CUDA, which will be used in the later path tracer project.
Stream compaction inculdes generally three steps:
1) Boolean Mapping: mapping the input date to a boolean array, where 1 if corresponding element meets criteria, 0 if element does not meet criteria.
2) Exclusive Scamn: run exclusive scan on the boolean array and save the result to a temporary array.
3) Scatter: According to the boolean array, if the element is 1, use the corresponding result from the scan temporary array as index, write the input data to the final result array.
<img src="/img/stream_compaction.jpg" width="300">

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
In order to implement the GPU stream compaction, the following algorithem is implemented in this project:

### CPU scan (exclusive prefix sum)
For performance comparison and better understanding before moving to GPU.
### CPU Stream Compaction without scan and CPU Stream Compaction with scan
For performance comparison and better understanding before moving to GPU.
### Naive GPU Scan (exclusive prefix sum)
As the example shows, because each GPU thread might read and write to the same index of array, two buffer is needed for this naive algorithem. One is for read only and one is for write only, switch the buffer is required after every iteration.
<img src="/img/naive_gpu_scan.jpg" width="300">
### Work-Efficient GPU Scan
This algothrithm is based on the one presented by Blelloch. As the example shows, the algorithem consists of two phases: up-sweep and down-sweep.
The up-sweep is also know as a parallel reduction, because after this phase, the last node in the array holds the sum of all nodes in the array.
In the down-sweep phase, we use partial sum from the up-sweep phase to build the scan of the array. Start by replacing the last node to zero, and for each step, each node at the current level pass its own value to its left child, its right child holds the sum of itself and the former left child.
<img src="/img/efficient_gpu_scan.jpg" width="300">

# Performance Analysis
For the performance analysis, I used blocksize of 256. I was testing on four array size, which are 2^12, 2^16, 2^20, 2^24. The less time consuming the faster the program run, the better the performance is. The time for GPU does not record the cudaMalloc and cudaMemcpy time.

## Scan (exclusive prefix sum)
<img src="/img/scan_analysis_large.png" width="700"> <img src="/img/scan_analysis.gif" width="700">
* As the left graph shows, when array size is from 2^8 - 2^16, CPU actually runs faster than GPU naive and work-efficient. I think the reason is for a certain number of data, the time cost on GPU for reading data from global memory can not leverage the time saved by the parallel calculation. So, the performance is quit the same or not as good as CPU.
However, when the amount of data reached a point, as the right graph shows, for my test is 2^20, GPU efficient algorithem starts to act way better than CPU and GPU naive algothrithm.

* No matter how large the array size is, the performance for thrust is always the best, even much better than GPU efficient. I think the reason is, for my version of GPU efficient algorithm, some thread is not used in the later iteration, but they still need to wait for the other active thread in the same warp to finish their work to become available for other work again.

* For the two GPU version, naive and work-efficient, naive cost more time, sometime even as same as CPU version. I think the most important reason is because we are ping-pong from two buffer every iteration, the cudaMemcpy cost most of the time and is very time consuming.

## Stream Compacton
<img src="/img/stream_compaction_large.png" width="700"> <img src="/img/stream_compaction_analysis.gif" width="700">
* Same with Scan, as the left graph shows, when array size is from 2^8 - 2^16, the two CPU functions run faster than GPU. And the reason is same as above.
Same, when the amount of data reached a point, as the right graph shows, for my test is 2^20, GPU efficient algorithm starts to act way better than the two CPU functions.
* What interesting here is, the CPU without scan perfored much better than CPU with scan. So apparently, when not running parallelly, the two extra buffer created when using scan in stream compaction are very time consuming. It only acts efficiently when running parallely.

## Output of the Test Program
```
****************
** SCAN TESTS **
****************
[ 17 0 9 24 3 46 31 40 8 44 48 9 24 ... 49 0 ]
==== cpu scan, power-of-two ====
elapsed time: 25.2266ms (std::chrono Measured)
[ 0 17 17 26 50 53 99 130 170 178 222 270 279 ... 410923873 410923922 ]
==== cpu scan, non-power-of-two ====
elapsed time: 24.2735ms (std::chrono Measured)
[ 0 17 17 26 50 53 99 130 170 178 222 270 279 ... 410923806 410923816 ]
passed
==== naive scan, power-of-two ====
elapsed time: 21.4419ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 21.3356ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 9.57216ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 9.46726ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.989184ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 1.25235ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 2 1 1 2 1 0 0 3 0 2 3 3 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 36.6462ms (std::chrono Measured)
[ 3 2 1 1 2 1 3 2 3 3 3 2 3 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 35.2236ms (std::chrono Measured)
[ 3 2 1 1 2 1 3 2 3 3 3 2 3 ... 1 2 ]
passed
==== cpu compact with scan ====
elapsed time: 94.3538ms (std::chrono Measured)
[ 3 2 1 1 2 1 3 2 3 3 3 2 3 ... 2 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 10.7924ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 11.4318ms (CUDA Measured)
passed
```

## Bloopers
This is more like a question than blooper.
I used pow() in kernel function at the first time and pass two int to it. When array size reached 2^11, the algorithm began to act weiredly. It took me an hour to find out that the problem was caused by pow() I was using.
It looks like in kernel, pow is defined as 'double result = pow(double x, double y)'
And there is this powf() which is defined as 'float result = powf(float x, float y)'
So, I changed it to powf(), however, it again began to act weiredly when array size reaced 2^16, I changed all the power function to bitwise eventually.
However, is there any criteria we need to follow when using power on kernel?

Binary file added img/efficient_gpu_scan.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/naive_gpu_scan.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_analysis.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_analysis_large.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stream_compaction.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stream_compaction_analysis.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stream_compaction_large.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 4 additions & 2 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array

const int SIZE = 1 << 24; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand All @@ -31,6 +32,7 @@ int main(int argc, char* argv[]) {
a[SIZE - 1] = 0;
printArray(SIZE, a, true);


// initialize b using StreamCompaction::CPU::scan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
// At first all cases passed because b && c are all zeroes.
Expand Down Expand Up @@ -71,7 +73,7 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
//printArray(SIZE, c, false);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
Expand Down
22 changes: 20 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,33 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= n) {
return;
}
if (idata[index]) {
bools[index] = 1;
}
else {
bools[index] = 0;
}
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
__global__ void kernScatter(int n, int *odata, const int *idata,
const int *bools, const int *indices)
{
// TODO
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= n) {
return;
}
if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
36 changes: 34 additions & 2 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int k = 1; k < n; ++k) {
odata[k] = odata[k - 1] + idata[k - 1];
}
timer().endCpuTimer();
}

Expand All @@ -31,8 +35,16 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int resultSize = 0;
for (int i = 0; i < n; ++i) {
int num = idata[i];
if (num) {
odata[resultSize] = num;
resultSize++;
}
}
timer().endCpuTimer();
return -1;
return resultSize;
}

/**
Expand All @@ -43,8 +55,28 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// create bool array
for (int i = 0; i < n; ++i) {
odata[i] = idata[i] ? 1 : 0;
}

// scan
int* scan = new int[n];
scan[0] = 0;
for (int i = 1; i < n; ++i) {
scan[i] = scan[i - 1] + odata[i - 1];
}

// scatter
int resultSize = 0;
for (int i = 0; i < n; ++i) {
if (odata[i]) {
odata[scan[i]] = idata[i];
resultSize++;
}
}
timer().endCpuTimer();
return -1;
return resultSize;
}
}
}
105 changes: 103 additions & 2 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
#include "common.h"
#include "efficient.h"

#define blockSize 256

namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
Expand All @@ -12,13 +14,67 @@ namespace StreamCompaction {
return timer;
}

__global__ void kernUpSweep(int n, int* data, int depth) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= n) {
return;
}
//int offset1 = powf(2, depth);
//int offset2 = powf(2, depth + 1);
int offset1 = 1<<depth;
int offset2 = 1 << (depth+1);
if (index % offset2 == 0) {
data[index + offset2 - 1] += data[index + offset1 - 1];
}
}

__global__ void kernDownSweep(int n, int* data, int depth) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= n) {
return;
}
int offset1 = 1 << depth;
int offset2 = 1 << (depth + 1);
if (index % offset2 == 0) {
int t = data[index + offset1 - 1];
data[index + offset1 - 1] = data[index + offset2 - 1];
data[index + offset2 - 1] += t;
}
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
void scan(int n, int *odata, const int *idata) {
int power = ilog2ceil(n);
int arraySize = pow(2, power);
dim3 blockPerGrid((arraySize + blockSize - 1) / blockSize);
dim3 threadPerBlock(blockSize);

int* dev_data;

// create memory
cudaMalloc((void**)&dev_data, arraySize * sizeof(int));
// set data and then copy the original data
cudaMemset(dev_data, 0, arraySize * sizeof(int));
cudaMemcpy(dev_data, idata, arraySize * sizeof(int), cudaMemcpyHostToDevice);

timer().startGpuTimer();
// TODO
for (int i = 0; i < power; i++) {
kernUpSweep << <blockPerGrid, threadPerBlock >> > (arraySize, dev_data, i);
}
// set the root to 0
cudaMemset(dev_data + arraySize - 1, 0, sizeof(int));
for (int i = power - 1; i >= 0; i--) {
kernDownSweep << <blockPerGrid, threadPerBlock >> > (arraySize, dev_data, i);
}
timer().endGpuTimer();

cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);

// free memory
cudaFree(dev_data);
}

/**
Expand All @@ -31,10 +87,55 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
int power = ilog2ceil(n);
int arraySize = pow(2, power);
dim3 blockPerGrid((arraySize + blockSize - 1) / blockSize);
dim3 threadPerBlock(blockSize);

int* dev_idata;
int* dev_odata;
int* dev_boolBuffer;
int* dev_scanResultBuffer;

// malloc
cudaMalloc((void**)&dev_idata, arraySize * sizeof(int));
cudaMalloc((void**)&dev_odata, arraySize * sizeof(int));
cudaMalloc((void**)&dev_boolBuffer, arraySize * sizeof(int));
cudaMalloc((void**)&dev_scanResultBuffer, arraySize * sizeof(int));

// set data and copy data
// important for non power of two data!
// if not set to 0, when map to boolean, the extra data which is not 0 will cause damage
cudaMemset(dev_idata , 0, arraySize * sizeof(int));
cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);

timer().startGpuTimer();
// TODO
StreamCompaction::Common::kernMapToBoolean << <blockPerGrid, threadPerBlock>> > (arraySize,
dev_boolBuffer,
dev_idata);
cudaMemcpy(dev_scanResultBuffer, dev_boolBuffer, arraySize * sizeof(int), cudaMemcpyDeviceToDevice);

for (int i = 0; i < power; ++i) {
kernUpSweep << <blockPerGrid, threadPerBlock >> > (arraySize, dev_scanResultBuffer, i);
}
// set the root to 0
cudaMemset(dev_scanResultBuffer + arraySize - 1, 0, sizeof(int));
for (int i = power - 1; i >= 0; --i) {
kernDownSweep << <blockPerGrid, threadPerBlock >> > (arraySize, dev_scanResultBuffer, i);
}

StreamCompaction::Common::kernScatter << <blockPerGrid, threadPerBlock >> > (arraySize,
dev_odata, dev_idata,
dev_boolBuffer, dev_scanResultBuffer);
timer().endGpuTimer();
return -1;

int* host_scanResultBuffer = new int[arraySize];
cudaMemcpy(host_scanResultBuffer, dev_scanResultBuffer, arraySize * sizeof(int), cudaMemcpyDeviceToHost);

int resultCount = host_scanResultBuffer[arraySize - 1];
cudaMemcpy(odata, dev_odata, resultCount * sizeof(int), cudaMemcpyDeviceToHost);
return resultCount;
}
}
}
Loading