Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.
/ BDTK Public archive

A modular acceleration toolkit for big data analytic engines

License

Notifications You must be signed in to change notification settings

intel/BDTK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

DISCONTINUATION OF PROJECT

This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Intel Big Data Analytic Toolkit

Intel Big Data Analytic Toolkit (abbrev. BDTK) is a set of acceleration libraries aimed to optimize big data analytic frameworks.

By using this library, frontend SQL engines like Prestodb/Spark query performance will be significant improved.

Problem Statement

For big data analytic framework users, it becomes more and more significant needs for better performance. And most of existing big data analytic frameworks are built via Java and it's designed mostly for CPU only computation. To unblock performance further to bare metal hardware, native implementation and leveraging state-of-art hardwares are employed in this toolkit.

Furthermore, assembling & building becomes a new trend for data analytic solution providers. More and more SQL based solutions were built based on some primitive building blocks over the last five years. Having some performt OOB building blocks (as libraries) can significantly reduce time-to-value for building everything from scratch. With such general-purpose toolkit, it can significantly reduce time-to-value for analytic solution developers.

Targeted Use Cases

BDTK focuses on following areas:

  • End-users of big data analytic frameworks who're looking for performance acceleration
  • Data engineers who want some Intel architecture-based optimizations
  • Database developers who're seeking for reusable building blocks
  • Data Scientist who looks for heterogenous execution

Users can reuse implemented operators/functions to build a full-featured SQL engine. Currently this library offers a highly optimized compiler to JITed function for execution.

Building blocks utilizing compression codec (based on IAA, QAT) can be used directly to Hadoop/Spark for compression acceleration.

Below comes the view from personas for this project.

BDTK-Personas

Introduction

The following diagram shows the design architecture. Currently, it offers a few building blocks including a lightweight LLVM based SQL compiler(Cider) on top of Arrow data format, ICL - a compression codec leveraging Intel IAA accelerator, QATCodec - compression codec wrapper based on Intel QAT accelerator.

BDTK-INTRODUCTION

Solutions Introduction

  • Presto E2E Solution:

    BDTK could provide a Presto End-to-End accelaration solution via Velox and Velox-Plugin.

    • Velox Plugin:

      Velox-plugin is a bridge to enable Big Data Analytic Toolkit onto Velox. It introduces hybrid execution mode for both compilation and vectorization (existed in Velox). It works as a plugin to Velox seamlessly without changing Velox code.

    • Cider:

      A modularized and general-purposed Just-In-Time (JIT) compiler for data analytic query engine. It employs Substrait as a protocol allowing to support multiple front-end engines. Currently it provides a LLVM based implementation based on HeavyDB.

  • Analytic Cache Solution

    Analytic Cache targets to improve data source side performance for multiple bigdata analytic framework such as Apache Spark and Apache Flink. Compare to other row based execution engine, Analytic Cache could utilize column format and do batch computation, which will boost performance in Ad-hoc queries. Meanwhile, Analytic Cache provide QAT codec accelaration and IAA predicition pushing down.

Reusable Modules

BDTK provides several functional modules for user to use or integrate into their product. Here are breif description for each module. Details can be found on Module Page

Intel Codec Library module provides compression and decompression library for Apache Hadoop/Spark to make use of the acceleration hardware for compression/decompression. It not only can leverage QAT/IAA hardware to accelerate deflate-compatible data compression algorithms but also supports the use of Intel software optimized solutions such as Intel ISA-L(Intel Intelligent Storage Acceleration Library and IPP(Intel Integrated Performance Primitives Library) to accelerate the data compression.

Hash table performance is critical to a SQL engine. Operators like hash join, hash aggregation count on an efficient hash table implementation.

Hash table module will provide a bunch of hash table implementations, which are easy to use, leverage state of art hardware technology like AVX-512, will be optimized for query-specific scenarios.

JIT Lib module provides unified JIT interfaces like Value, Ptr, control flow and .etc to isolate operator logic and IR generation

Expression evaluation module could do Projection/Filter computation effectively. It provides a runtime expression evaluation API which accept Substrait based expression representation and Apache Arrow based column format data representation. It only handles projection and filters currently.

BDTK implements typical SQL operators based on JitLib, provide a batch-at-a-time execution model. Each operator support plug and play, could easily integrated into other existing sql-engines. Operators BDTK target to supported includes: HashAggregation, HashJoin(HashBuild and HashProbe), etc.

Supported Features

Current supported features are available on Project Page. Newly supported feature in release 0.9 is available at release page.

Getting Started

Get the BDTK Source

git clone --recursive https://github.com/intel/BDTK.git
cd BDTK
# if you are updating an existing checkout
git submodule sync --recursive
git submodule update --init --recursive

Setting up BDTK develop envirenmont on Linux Docker

We provide Dockerfile to help developers setup and install BDTK dependencies.

  1. Build an image from a Dockerfile
$ cd ${path_to_source_of_bdtk}/ci/docker
$ docker build -t ${image_name} .
  1. Start a docker container for development
$ docker run -d --name ${container_name} --privileged=true -v ${path_to_source_of_bdtk}:/workspace/bdtk ${image_name} /usr/sbin/init

How To Build

Once you have setup the Docker build environment for BDTK and get the source, you can enter the BDTK container and build like:

Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version, or make release to build an optimized version. Use make test-debug or make test-release to run tests.

How to Enable in Presto

To use it with Prestodb, Intel version Prestodb is required together with Intel version Velox. Detailed steps are available at installation guide.

Roadmap

In the next coming release, following working items were prioritized.

  • Better test coverage for entire library
  • Better robustness and enable more implemented features in Prestodb as pilot SQL engine, by improving offloading framework
  • Better extensibility at multi-levels (incl. relational algebra operator, expression function, data format), by adopting state-of-art compiler design (multi-levels)
  • Complete Arrow format migration
  • Next-gen codegen framework
  • Support large volume data processing
  • Advanced features development

Code Of Conduct

Big Data Analytic Toolkit's Code of Conduct can be found here.

Online Documentation

You can find the all the Big Data Analytic Toolkit documents on the project web page.

License

Big Data Analytic Toolkit is licensed under the Apache 2.0 License. A copy of the license can be found here.