Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip pre-decompress w/IAA #6176

Closed
wants to merge 2 commits into from

Conversation

yaqi-zhao
Copy link
Contributor

The Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator that provides very high
throughput compression and decompression combined with primitive analytic functions. It is available in the newest generation of Intel® Xeon® Scalable processors ("Sapphire Rapids").
We can offload the GZip (window size is 4KB) decompression to the IAA hardware and save the CPU bandwidth. Here is a description of how to offload the GZip decompression to the IAA hardware

#5718

@netlify
Copy link

netlify bot commented Aug 21, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 4ab13b3
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65c334d733aadb00085fceb0

@yaqi-zhao yaqi-zhao force-pushed the merge_branch_1 branch 2 times, most recently from 325e9c1 to 1f15edb Compare August 21, 2023 03:37
velox/dwio/parquet/reader/IAAPageReader.cpp Show resolved Hide resolved
velox/dwio/parquet/reader/ParquetData.cpp Outdated Show resolved Hide resolved
velox/dwio/common/QplJobPool.h Outdated Show resolved Hide resolved
velox/dwio/common/QplJobPool.cpp Outdated Show resolved Hide resolved
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 21, 2023
@yaqi-zhao yaqi-zhao force-pushed the merge_branch_1 branch 2 times, most recently from 74eab9d to 40c9a62 Compare August 24, 2023 08:36
@yaqi-zhao yaqi-zhao force-pushed the merge_branch_1 branch 3 times, most recently from 6ff1026 to 842de1b Compare August 24, 2023 09:05
@Yuhta Yuhta self-requested a review August 24, 2023 15:25
@yaqi-zhao
Copy link
Contributor Author

Hi, @Yuhta. Thanks for your comments. I rebase my work on #5914. IAAPageReader is deleted and the new logic is implemented inside the PageReader according to your suggestion. Can you continue the PR review?

@yaqi-zhao yaqi-zhao force-pushed the merge_branch_1 branch 3 times, most recently from 5b84c3f to ca13be9 Compare September 27, 2023 07:54
@yaqi-zhao yaqi-zhao force-pushed the merge_branch_1 branch 7 times, most recently from 6e20474 to c042e46 Compare October 26, 2023 05:47
@george-gu-2021
Copy link

@Yuhta , we updated the PR again which proves observable gain in Spark/Gluten/Velox environment w/ a couple of TPC-H queries, eg, Q15, Q14. May you help review it? Thanks very much!

@pedroerp
Copy link
Contributor

pedroerp commented Nov 3, 2023

@Yuhta , we updated the PR again which proves observable gain in Spark/Gluten/Velox environment w/ a couple of TPC-H queries, eg, Q15, Q14. May you help review it? Thanks very much!

HI @george-gu-2021 can you add more context on the performance benefits you're seeing in your benchmarks? If there is a place where I can read more, please send us the link

@george-gu-2021
Copy link

@Yuhta , we updated the PR again which proves observable gain in Spark/Gluten/Velox environment w/ a couple of TPC-H queries, eg, Q15, Q14. May you help review it? Thanks very much!

HI @george-gu-2021 can you add more context on the performance benefits you're seeing in your benchmarks? If there is a place where I can read more, please send us the link

Hi @pedroerp , the PR is to implement "Velox parquet scan acceleration w/ Intel IAA (in memory acceleration)" which leverages on-die accelerator to offload gzip decompression in Velox parquet scan. Its overall design and PR context is illustrated in the issue: #5718 . In short, some queries perf can boost up to 40% against the query perf on zstd SW decompression based parquet scan.

Regarding the latest update for Q15 and Q14, it is because it fix a skip-page call flow which is missing in the PR initial version.

Hi @yaqi-zhao , be free to correct me or add more info if anything missing or incorrect. Thanks!

@pedroerp
Copy link
Contributor

pedroerp commented Nov 4, 2023

@george-gu-2021 thank you for the context. From Kelly's presentation on the monthly OSS meeting my understanding was that IAA only supported compression (hence why we were evaluating and considering it for table writes). Does it actually support compression as well, or is it a different technology?

Cc: @mbasmanova

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
third_party/CMakeLists.txt Outdated Show resolved Hide resolved
velox/dwio/common/CMakeLists.txt Outdated Show resolved Hide resolved
velox/dwio/common/CMakeLists.txt Outdated Show resolved Hide resolved
velox/dwio/common/QplJobPool.cpp Outdated Show resolved Hide resolved
velox/dwio/common/QplJobPool.cpp Show resolved Hide resolved
velox/dwio/common/QplJobPool.cpp Outdated Show resolved Hide resolved
velox/dwio/common/QplJobPool.h Outdated Show resolved Hide resolved
velox/dwio/common/QplJobPool.h Show resolved Hide resolved
@yaqi-zhao yaqi-zhao force-pushed the merge_branch_1 branch 3 times, most recently from 9172107 to ef56608 Compare November 7, 2023 08:42
@yaqi-zhao
Copy link
Contributor Author

This would be my preference. Can we start a discussion with some thoughts on how this API could be built to hide the accelerator complexity from the library (parquet reader/writer, shuffle operator, etc). I'm concerned about polluting the codebase with too many accelerator details, making it more complex and error prone.
Happy to help iterate on that design.

That's why I 'd comment to remove all accelerator related logic from s/w path. @yaqi-zhao can you have a design and start a discussion?

@FelixYBW @pedroerp @george-gu-2021 I have created a discussion(#7445) . I add solution introduction and duplicated code analysis based on current PR. Please add your insights on this discussion. Thanks a lot!

@FelixYBW
Copy link

@yaqi-zhao you may hold on the PR. Rong is creating the unified compression codec API, including sync and async. Let's finish the PR firstly.

@Yuhta
Copy link
Contributor

Yuhta commented Nov 20, 2023

Generally both QAT and IAA can be a clean codec library to be added to de-/compression. It can be reused in parquet reader/scan, shuffle and spill. Currently we already enabled them in Gluten's shuffle.

@FelixYBW I can see that zlib window size 4KB is not a standard setup for parquet files. So even we add it to table scan, there is no file we can read to benefit from. Would it make more sense to add to shuffle first to see some real world benefits?

@george-gu-2021
Copy link

Generally both QAT and IAA can be a clean codec library to be added to de-/compression. It can be reused in parquet reader/scan, shuffle and spill. Currently we already enabled them in Gluten's shuffle.

@FelixYBW I can see that zlib window size 4KB is not a standard setup for parquet files. So even we add it to table scan, there is no file we can read to benefit from. Would it make more sense to add to shuffle first to see some real world benefits?

Hi @Yuhta , window size 4KB is a parameter that Arrow exposes in its interfaces and users can set that per their preference while generating parquet files typically in ETL processing. Some partners are open to config that. In our current validation stage, we generate some 4KB zlib parquet stream with Velox (including Arrow module) and we are happy to share the generation process and sample parquet streams if anyone needs those. Thanks!

@Yuhta
Copy link
Contributor

Yuhta commented Nov 21, 2023

Hi @Yuhta , window size 4KB is a parameter that Arrow exposes in its interfaces and users can set that per their preference while generating parquet files typically in ETL processing. Some partners are open to config that. In our current validation stage, we generate some 4KB zlib parquet stream with Velox (including Arrow module) and we are happy to share the generation process and sample parquet streams if anyone needs those. Thanks!

But that information is not written to the parquet file so if the file is not written by ourselves there is no way to get it. Most of the other writers just write with the default window size of 32KB. That forces us to use the default window size of 32KB most of the time. Would be nice if IAA/QAT can support 32 KB window size. Is it an hardware restriction?

// 'rowOfPage_' is the row number of the first row of the next page.
this->rowOfPage_ += this->numRowsInPage_;

if (seekToPreDecompPage(row)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create some virtual hooks in the base class for these calls, then you don't need to duplicate the other part in subclasses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have thought over this solution, but there will be a lot of code changes in the current PageReader, do you think it is reasonable?

Copy link
Contributor

@Yuhta Yuhta Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there will be a lot of more change? You just need to add a few more virtual functions with empty implementation and invoke them in the places needed. No existing logic will be touched.

velox/dwio/parquet/reader/IAAPageReader.h Outdated Show resolved Hide resolved
}
this->updateRowInfoAfterPageSkipped();
}
if (isWinSizeFit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hook


return;
}
if (job_success) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hook

BufferPtr uncompressedData;
};

class IAAPageReader : public PageReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems no need for PageReaderBase if you are inheriting from the concrete PageReader

Copy link
Contributor Author

@yaqi-zhao yaqi-zhao Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to create PageReaderBase is to realize Polymorphism, so that when ParquetData call the same PageReaderBase function behaves differently in different scenarios

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading from the code I think we can do it a little bit differently. What you really need in IAAPageReader is a set of extension points (hooks) that you run extra code, in addition to the basic PageReader. So you can add a few virtual methods in PageReader, default to no-op, call them in the expected places. Then in IAAPageReader you add the implementations for these hooks. Does it sound good to you?

dictionaryEncoding_ == Encoding::PLAIN);

if (codec_ != thrift::CompressionCodec::UNCOMPRESSED) {
if (job_success) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hook

@george-gu-2021
Copy link

Hi @Yuhta , window size 4KB is a parameter that Arrow exposes in its interfaces and users can set that per their preference while generating parquet files typically in ETL processing. Some partners are open to config that. In our current validation stage, we generate some 4KB zlib parquet stream with Velox (including Arrow module) and we are happy to share the generation process and sample parquet streams if anyone needs those. Thanks!

But that information is not written to the parquet file so if the file is not written by ourselves there is no way to get it. Most of the other writers just write with the default window size of 32KB. That forces us to use the default window size of 32KB most of the time. Would be nice if IAA/QAT can support 32 KB window size. Is it an hardware restriction?

Recently most SW stacks use the default setting (32KB) except some scenarios need to set "4KB" history buffer purposely to take care of memory capacity constraints to avoid OOM or spill operations. Regarding HW capability, the current generation IAA does have 4KB history buffer limitation as well. That is why we propose to add the logic to open the option for users to leverage IAA HW where it is applicable. Thanks! @Yuhta

@FelixYBW
Copy link

Generally both QAT and IAA can be a clean codec library to be added to de-/compression. It can be reused in parquet reader/scan, shuffle and spill. Currently we already enabled them in Gluten's shuffle.

@FelixYBW I can see that zlib window size 4KB is not a standard setup for parquet files. So even we add it to table scan, there is no file we can read to benefit from. Would it make more sense to add to shuffle first to see some real world benefits?

We already added it to Gluten's shuffle and are working on Spill through the unified Compression API codec in this PR: #7589

@george-gu-2021
Copy link

Generally both QAT and IAA can be a clean codec library to be added to de-/compression. It can be reused in parquet reader/scan, shuffle and spill. Currently we already enabled them in Gluten's shuffle.

@FelixYBW I can see that zlib window size 4KB is not a standard setup for parquet files. So even we add it to table scan, there is no file we can read to benefit from. Would it make more sense to add to shuffle first to see some real world benefits?

We already added it to Gluten's shuffle and are working on Spill through the unified Compression API codec in this PR: #7589

Hi @FelixYBW , partners will like the feature! Per the communication w/ them, they are highly interested in leveraging Gluten/Velox to conduct ETL and generate parquet data once feature is ready.

Copy link
Collaborator

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be any major change in the Parquet folder. but, let's wait for #7471 to be merged first.

@@ -116,6 +116,7 @@ class ReaderBase {
std::shared_ptr<const dwio::common::TypeWithId> schemaWithId_;

const bool binaryAsString = false;
bool needPreDecomp = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be in dwio::common::compression

@yaqi-zhao yaqi-zhao force-pushed the merge_branch_1 branch 2 times, most recently from b3ca532 to 4ab13b3 Compare February 7, 2024 07:44
Copy link

stale bot commented May 8, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label May 8, 2024
@stale stale bot closed this May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.