Skip to content

Parallel Aggregation Primitive on GPU

SimonYu0521 edited this page Dec 15, 2023 · 12 revisions

Welcome to the Parallel Aggregation Primitive on GPU wiki!

This is the project site for CMU Parallel Computer Architecture and Programming's final project.

SUMMARY

We will implement the aggregation primitive with both SYCL and CUDA and run it on the NVIDIA 2080 GPU, and perform a detailed analysis of both versions’ performance optimization characteristics. The SYCL code will be compiled by Intel oneAPI to the NVIDIA GPU backend.

BACKGROUND

Database queries, such as ones described by DML(Data Manipulation Languages) such as SQL, are executed by database operation primitives. Aggregation is one of the primitives, and it is frequently used, 14 out of 22 queries in TPC-H involve aggregation. [1][2]

In simple terms, an aggregation performs the following: Aggregate(Input table) => Result Table

image

Where a summation of values is performed on each key. In a database setting, these tables can have millions of rows, therefore there is a chance to parallelize this operation to vastly improve the performance. A basic CPU-based multi-threaded algorithm can be described by the following pseudo-code. [1] image

THE CHALLENGE

  • The workload: The workload is essentially a parallel sum problem with an overlay of a “hashing” aspect. For each key element, we first undergo a local aggregation, and a parallel merge to compute the final value for the key element. The arithmetic intensity is low as we are just streaming through the data set and performing accumulation. The only reuse happens at the “hash table” lookup. We hope to optimize the process of looking up the “hashtable” on a GPU. For the non-reuse memory access pattern, which is when we stream the data, we might need to experiment with pretech instructions in CUDA and SYCL. Additionally, how we orchestrate the local aggregation and parallel merge can have an impact on the performance.

  • The constraints: Because of the nature of the GPU being a throughput machine and its memory hierarchy, there will be data movement between SMM units during the parallel merge step. How to minimize the communication and be able to express it with the specific programming abstraction will be a challenge.

RESOURCES

Reference Paper:

[1] Z. F. Eryilmaz, A. Kakaraparthy, J. M. Patel, R. Sen and K. Park, "FPGA for Aggregate Processing: The Good, The Bad, and The Ugly," 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 2021, pp. 1044-1055, doi: 10.1109/ICDE51399.2021.00095.

[2] "TPC-H", [online] Available: http://www.tpc.org/tpch/.

[3] Karnagel, Tomas, René Müller, and Guy M. Lohman. "Optimizing GPU-accelerated Group-By and Aggregation." ADMS@ VLDB 8 (2015): 20.

[4] Tomé, Diego G., et al. "Optimizing Group-By and Aggregation using GPU-CPU Co-Processing." ADMS@ VLDB. 2018.

GOALS AND DELIVERABLES

  • Deliverables: Two versions of our implementation one in SYCL one in CUDA
  • Poster Session Deliverables: Plots for performance analysis across tunable parameters.
  • Goals(at proposal time):
    • (Plan) Implement naive parallel aggregation in both languages, with no optimization of local “scratchpad” in each partition
    • (Plan) Implement hash table implementation described in [3][4].
    • (Plan) Performance comparison between:
      • Sycl naive vs Sycl hashtable (Can SYCL achieve better performance through development?)
      • Sycl naive vs CUDA naive (How do SYCL and CUDA compare out-of-the-box)
      • CUDA naive vs CUDA hashtable (Can CUDA achieve better performance through development? Is the improvement bigger in CUDA or SYCL)
      • Sycl hashtable vs CUDA hashtable (How do SYCL and CUDA limit the ceiling optimization?)
    • (Reach) Implement scratchpad optimizations from our own ideas and analyze their performance
  • Goals (revised at milestone): We will produce an optimization study in both programming languages, documenting how each step of optimization influences the performance crossed with the two languages. This way, we gain insight into how each optimization improves performance within a language, and evaluate the ease of implementing optimizations in each language. Some planned optimization strategies: -From array-based scratchpad to hashtable-based -Placing the hashtables into GPU's shared memory -Reordering of inputs to minimize memory contention

PLATFORM CHOICE

Our current plan is to run on GHC’s machine which uses NVIDIA 2080 GPU since its available to us. SYCL is a heterogeneous programming language that is supposed to be portable to all compute platforms from CPUs, GPUs, to FPGAs. It is more flexible than CUDA, but can it achieve the same performance as a platform-specific language?

We chose CUDA as the other programming language because it is designed by NVIDIA and has optimization for their GPU so we are curious how SYCL will perform compared to a programming platform that is specifically designed for an architecture.

SCHEDULE

WEEK 1 Nov 15 - Nov 19: Set up the software environments for SYCL and CUDA, start implementing Naive versions

WEEK 2 Nov 20 - Nov 26: Finish implementation and start adding performance collection

WEEK 3 Nov 27 - Dec 03: Start hashtable optimizations, start performance analysis

WEEK 4 Dec 04 - Dec 10: Optimization

  Xiying: ( cuco Implementation | SimpleHash Optimization )

  Simon: Experiment with introducing partial hashtable in GPU shared memory to improve memory limitation

WEEK 5 Dec 11 - Dec 14:

  First half: More comprehensive benchmarking with more fine-grained parameter sweeping (All)   First half: Handle overflowed tasks (All)   Second half: Making final report/poster (All)

PROGRESS

DONE, goal achieved.

##Work Split 50/50 work split with one person on SYCL another on CUDA. Carried out analysis together

RESULTS

---* Link to the Report *---