[NSE-699] Update Changelog and documents for OAP 1.4.0 (#1003)

* [NSE-699] Update Changelog and documents for OAP 1.4.0 * Update documents changes * Update required GCC version * Update Arrow branch
oap-project · Jul 1, 2022 · 80c72aa · 80c72aa
1 parent e9cc866
commit 80c72aa
Show file tree

Hide file tree

Showing 14 changed files with 1,090 additions and 70 deletions.
diff --git a/.github/workflows/tpch.yml b/.github/workflows/tpch.yml
@@ -51,7 +51,7 @@ jobs:
         run: |
           cd /tmp
           git clone https://github.com/oap-project/arrow.git
-          cd arrow && git checkout arrow-4.0.0-oap && cd cpp
+          cd arrow && git checkout arrow-4.0.0-oap-1.4 && cd cpp
           mkdir build && cd build
           cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_ORC=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF && make -j2
           sudo make install

diff --git a/.github/workflows/unittests.yml b/.github/workflows/unittests.yml
@@ -54,7 +54,7 @@ jobs:
         run: |
           cd /tmp
           git clone https://github.com/oap-project/arrow.git
-          cd arrow && git checkout arrow-4.0.0-oap && cd cpp
+          cd arrow && git checkout arrow-4.0.0-oap-1.4 && cd cpp
           mkdir build && cd build
           cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_ORC=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DGTEST_ROOT=/usr/src/gtest && make -j2
           sudo make install
@@ -97,7 +97,7 @@ jobs:
         run: |
           cd /tmp
           git clone https://github.com/oap-project/arrow.git
-          cd arrow && git checkout arrow-4.0.0-oap && cd cpp
+          cd arrow && git checkout arrow-4.0.0-oap-1.4 && cd cpp
           mkdir build && cd build
           cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_ORC=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DGTEST_ROOT=/usr/src/gtest && make -j2
           sudo make install
@@ -142,7 +142,7 @@ jobs:
         run: |
           cd /tmp
           git clone https://github.com/oap-project/arrow.git
-          cd arrow && git checkout arrow-4.0.0-oap && cd cpp
+          cd arrow && git checkout arrow-4.0.0-oap-1.4 && cd cpp
           mkdir build && cd build
           cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_ORC=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DGTEST_ROOT=/usr/src/gtest && make -j2
           sudo make install

diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/TPP.txt b/TPP.txt
diff --git a/arrow-data-source/README.md b/arrow-data-source/README.md
@@ -13,7 +13,7 @@ The development of this library is still in progress. As a result some of the fu
 There are some requirements before you build the project.
 Please make sure you have already installed the software in your system.
 
-1. GCC 7.0 or higher version
+1. GCC 9.0 or higher version
 2. java8 OpenJDK -> yum install java-1.8.0-openjdk
 3. cmake 3.16 or higher version
 4. maven 3.6 or higher version
@@ -117,7 +117,7 @@ You have to use a customized Arrow to support for our datasets Java API.
 
 ```
 // build arrow-cpp
-git clone -b arrow-4.0.0-oap https://github.com/oap-project/arrow.git
+git clone -b arrow-4.0.0-oap-1.4 https://github.com/oap-project/arrow.git
 cd arrow/cpp
 mkdir build
 cd build

diff --git a/arrow-data-source/script/build_arrow.sh b/arrow-data-source/script/build_arrow.sh
@@ -62,7 +62,7 @@ echo "ARROW_SOURCE_DIR=${ARROW_SOURCE_DIR}"
 echo "ARROW_INSTALL_DIR=${ARROW_INSTALL_DIR}"
 mkdir -p $ARROW_SOURCE_DIR
 mkdir -p $ARROW_INSTALL_DIR
-git clone https://github.com/oap-project/arrow.git  --branch arrow-4.0.0-oap $ARROW_SOURCE_DIR
+git clone https://github.com/oap-project/arrow.git  --branch arrow-4.0.0-oap-1.4 $ARROW_SOURCE_DIR
 pushd $ARROW_SOURCE_DIR
 
 cmake ./cpp \

diff --git a/docs/ApacheArrowInstallation.md b/docs/ApacheArrowInstallation.md
@@ -30,7 +30,7 @@ Please make sure your cmake version is qualified based on the prerequisite.
 # Arrow
 ``` shell
 git clone https://github.com/oap-project/arrow.git
-cd arrow && git checkout arrow-4.0.0-oap
+cd arrow && git checkout arrow-4.0.0-oap-1.4
 mkdir -p arrow/cpp/release-build
 cd arrow/cpp/release-build
 cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_ORC=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..

diff --git a/docs/Installation.md b/docs/Installation.md
@@ -31,7 +31,7 @@ Based on the different environment, there are some parameters can be set via -D
 | arrow_root | When build_arrow set to False, arrow_root will be enabled to find the location of your existing arrow library. | /usr/local |
 | build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |
 
-When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
+When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap-1.4)
 If you wish to change any parameters from Arrow, you can change it from the `build_arrow.sh` script under `gazelle_plugin/arrow-data-source/script/`.
 
 ### Additional Notes

diff --git a/docs/OAP-Developer-Guide.md b/docs/OAP-Developer-Guide.md
@@ -4,8 +4,8 @@ This document contains the instructions & scripts on installing necessary depend
 You can get more detailed information from OAP each module below.
 
 
-* [OAP MLlib](https://github.com/oap-project/oap-mllib/tree/v1.3.1)
-* [Gazelle Plugin](https://github.com/oap-project/gazelle_plugin/tree/v1.3.1)
+* [OAP MLlib](https://github.com/oap-project/oap-mllib/tree/v1.4.0)
+* [Gazelle Plugin](https://github.com/oap-project/gazelle_plugin/tree/v1.4.0)
 
 ## Building OAP
 
@@ -18,14 +18,14 @@ We provide scripts to help automatically install dependencies required, please c
 # cd oap-tools
 # sh dev/install-compile-time-dependencies.sh
 ```
-*Note*: oap-tools tag version `v1.3.1` corresponds to  all OAP modules' tag version `v1.3.1`.
+*Note*: oap-tools tag version `v1.4.0` corresponds to  all OAP modules' tag version `v1.4.0`.
 
 Then the dependencies below will be installed:
 
 * [Cmake](https://cmake.org/install/)
-* [GCC > 7](https://gcc.gnu.org/wiki/InstallingGCC)
+* [GCC > 9](https://gcc.gnu.org/wiki/InstallingGCC)
 * [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)
-* [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.3.1)
+* [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.4.0)
 * [LLVM](https://llvm.org/) 
 
 

diff --git a/docs/OAP-Installation-Guide.md b/docs/OAP-Installation-Guide.md
@@ -11,7 +11,7 @@ Follow steps below on ***every node*** of your cluster to set right environment
 ### Prerequisites 
 
 - **OS Requirements**  
-We have tested OAP on Fedora 29 and CentOS 7.6 (kernel-4.18.16). We recommend you use **Fedora 29 CentOS 7.6 or above**. Besides, for [Memkind](https://github.com/memkind/memkind/tree/v1.10.1-rc2) we recommend you use **kernel above 3.10**.
+We have tested OAP on Fedora 29, CentOS 7.6 (kernel 4.18.16) and Ubuntu 20.04 (kernel 5.4.0-65-generic).  
 
 - **Conda Requirements**   
 Install Conda on your cluster nodes with below commands and follow the prompts on the installer screens.:
@@ -28,7 +28,7 @@ To test your installation,  run the command `conda list` in your terminal window
 Create a Conda environment and install OAP Conda package.
 
 ```bash
-$ conda create -n oapenv -c conda-forge -c intel -y oap=1.3.1
+$ conda create -n oapenv -c conda-forge -c intel -y oap=1.4.0.spark32
 ```
 
 Once finished steps above, you have completed OAP dependencies installation and OAP building, and will find built OAP jars under `$HOME/miniconda2/envs/oapenv/oap_jars`

diff --git a/docs/Prerequisite.md b/docs/Prerequisite.md
@@ -3,7 +3,7 @@
 There are some requirements before you build the project.
 Please make sure you have already installed the software in your system.
 
-1. GCC 7.0 or higher version
+1. GCC 9.0 or higher version
 2. LLVM 7.0 or higher version
 3. java8 OpenJDK -> yum install java-1.8.0-openjdk
 4. cmake 3.16 or higher version
@@ -12,38 +12,36 @@ Please make sure you have already installed the software in your system.
 7. Spark 3.1.x or Spark 3.2.x
 8. Intel Optimized Arrow 4.0.0
 
-## gcc installation
+## GCC installation
 
-// installing GCC 7.0 or higher version
+Please install GCC 9.0 or higher version.
 
-Please notes for better performance support, GCC 7.0 is a minimal requirement with Intel Microarchitecture such as SKYLAKE, CASCADELAKE, ICELAKE.
+Please note for better performance support, GCC 9.0 is a minimal requirement with Intel Microarchitecture such as SKYLAKE, CASCADELAKE, ICELAKE.
 https://gcc.gnu.org/install/index.html
 
-Follow the above website to download gcc.
-C++ library may ask a certain version, if you are using GCC 7.0 the version would be libstdc++.so.6.0.28.
-You may have to launch ./contrib/download_prerequisites command to install all the prerequisites for gcc.
+Follow the above website to download GCC.
 If you are facing downloading issue in download_prerequisites command, you can try to change ftp to http.
 
-//Follow the steps to configure gcc
+Follow the steps to configure GCC
 https://gcc.gnu.org/install/configure.html
 
 If you are facing a multilib issue, you can try to add --disable-multilib parameter in ../configure
 
-//Follow the steps to build gc
+Follow the steps to build GCC
 https://gcc.gnu.org/install/build.html
 
-//Follow the steps to install gcc
+Follow the steps to install GCC
 https://gcc.gnu.org/install/finalinstall.html
 
-//Set up Environment for new gcc
+Set up environment for new GCC
 ```
 export PATH=$YOUR_GCC_INSTALLATION_DIR/bin:$PATH
 export LD_LIBRARY_PATH=$YOUR_GCC_INSTALLATION_DIR/lib64:$LD_LIBRARY_PATH
 ```
 Please remember to add and source the setup in your environment files such as /etc/profile or /etc/bashrc
 
-//Verify if gcc has been installation
-Use `gcc -v` command to verify if your gcc version is correct.(Must larger than 7.0)
+Verify if GCC has been installed.
+Use `gcc -v` command to verify if your gcc version is correct.(Must larger than 9)
 
 ## LLVM 7.0 installation
 

diff --git a/docs/User-Guide.md b/docs/User-Guide.md
@@ -24,8 +24,8 @@ With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possib
 
 ![Overview](./image/dataset.png)
 
-A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/native-sql-engine/tree/master/arrow-data-source)
-Note both data source V1 and V2 are supported. Please check the [example section](../arrow-data-source/#run-a-query-with-arrowdatasource-scala) for arrow data source
+A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/gazelle_plugin/tree/main/arrow-data-source)
+Note both data source V1 and V2 are supported. Please check the [example section](https://github.com/oap-project/gazelle_plugin/tree/main/arrow-data-source#run-a-query-with-arrowdatasource-scala) for arrow data source
 
 ### Apache Arrow Compute/Gandiva based operators
 

diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,44 @@
+# Gazelle Plugin
+
+A Native Engine for Spark SQL with vectorized SIMD optimizations.  Please refer to [user guide](./User-Guide.md) for details on how to enable Gazelle.
+
+## Introduction
+
+![Overview](./image/nativesql_arch.png)
+
+Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD-optimized kernels and LLVM-based SQL engine Gandiva are also very efficient.
+
+Gazelle Plugin reimplements Spark SQL execution layer with SIMD-friendly columnar data processing based on Apache Arrow,
+and leverages Arrow's CPU-cache friendly columnar in-memory layout, SIMD-optimized kernels and LLVM-based expression engine to bring better performance to Spark SQL.
+
+## Performance data
+
+For advanced performance testing, below charts show the results by using two benchmarks with Gazelle v1.1: 1. Decision Support Benchmark1 and 2. Decision Support Benchmark2.
+The testing environment for Decision Support Benchmark1 is using 1 master + 3 workers and Intel(r) Xeon(r) Gold 6252 CPU|384GB memory|NVMe SSD x3 per single node with 1.5TB dataset and parquet format.
+* Decision Support Benchmark1 is a query set modified from [TPC-H benchmark](http://tpc.org/tpch/default5.asp). We change Decimal to Double since Decimal hasn't been supported in OAP v1.0-Gazelle Plugin.
+  Overall, the result shows a 1.49X performance speed up from OAP v1.0-Gazelle Plugin comparing to Vanilla SPARK 3.0.0.
+  We also put the detail result by queries, most of queries in Decision Support Benchmark1 can take the advantages from Gazelle Plugin. The performance boost ratio may depend on the individual query.
+
+![Performance](./image/decision_support_bench1_result_in_total_v1.1.png)
+
+![Performance](./image/decision_support_bench1_result_by_query_v1.1.png)
+
+The testing environment for Decision Support Benchmark2 is using 1 master + 3 workers and Intel(r) Xeon(r) Platinum 8360Y CPU|1440GB memory|NVMe SSD x4 per single node with 3TB dataset and parquet format.
+* Decision Support Benchmark2 is a query set modified from [TPC-DS benchmark](http://tpc.org/tpcds/default5.asp). We change Decimal to Doubel since Decimal hasn't been supported in OAP v1.0-Gazelle Plugin.
+  We pick up 10 queries which can be fully supported in OAP v1.0-Gazelle Plugin and the result shows a 1.26X performance speed up comparing to Vanilla SPARK 3.0.0.
+
+![Performance](./image/decision_support_bench2_result_in_total_v1.1.png)
+
+Please note that the performance data is not an official from TPC-H and TPC-DS. The actual performance result may vary by individual workloads. Please try your workloads with Gazelle Plugin first and check the DAG or log file to see if all the operators can be supported in OAP-Gazelle Plugin. Please check the [detailed page](./performance.md) on performance tuning for TPC-H and TPC-DS workloads.
+
+
+## Coding Style
+
+* For Java code, we used [google-java-format](https://github.com/google/google-java-format)
+* For Scala code, we used [Spark Scala Format](https://github.com/apache/spark/blob/master/dev/.scalafmt.conf), please use [scalafmt](https://github.com/scalameta/scalafmt) or run ./scalafmt for scala codes format
+* For Cpp codes, we used Clang-Format, check on this link [google-vim-codefmt](https://github.com/google/vim-codefmt) for details.
+
+## Contact
+
+[email protected]
+[email protected]
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -6,16 +6,19 @@ edit_uri: ""
 
 
 nav:
-- User Guide: User-Guide.md
-- Prerequisite: Prerequisite.md
-- Installation: Installation.md
-- InstallationNotes: InstallationNotes.md
-- Configuration: Configuration.md
-- SparkInstallation: SparkInstallation.md
-- ApacheArrowInstallation: ApacheArrowInstallation.md
-- OAP Installation Guide: OAP-Installation-Guide.md
-- OAP Developer Guide: OAP-Developer-Guide.md
-- Version Selector: "../"
+  - User Guide: User-Guide.md
+  - Prerequisite: Prerequisite.md
+  - Installation: Installation.md
+  - InstallationNotes: InstallationNotes.md
+  - Configuration: Configuration.md
+  - SparkInstallation: SparkInstallation.md
+  - ApacheArrowInstallation: ApacheArrowInstallation.md
+  - Columnar UDF Development: Columnar-UDF-Development-Guide.md
+  - Memory Allocation in Gazelle Plugin: memory.md
+  - Performance Tuning for Gazelle Plugin: performance.md
+  - OAP Installation Guide: OAP-Installation-Guide.md
+  - OAP Developer Guide: OAP-Developer-Guide.md
+  - Version Selector: "../"
 
 
 
@@ -25,5 +28,5 @@ theme: readthedocs
 plugins:
   - search
   - mkdocs-versioning:
-      version: 1.3.1
+      version: 1.4.0
       exclude_from_nav: ["image", "js", "css", "fonts", "img"]