-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gzip pre-decompress w/IAA #6176
Conversation
✅ Deploy Preview for meta-velox canceled.
|
325e9c1
to
1f15edb
Compare
74eab9d
to
40c9a62
Compare
6ff1026
to
842de1b
Compare
5b84c3f
to
ca13be9
Compare
6e20474
to
c042e46
Compare
@Yuhta , we updated the PR again which proves observable gain in Spark/Gluten/Velox environment w/ a couple of TPC-H queries, eg, Q15, Q14. May you help review it? Thanks very much! |
HI @george-gu-2021 can you add more context on the performance benefits you're seeing in your benchmarks? If there is a place where I can read more, please send us the link |
Hi @pedroerp , the PR is to implement "Velox parquet scan acceleration w/ Intel IAA (in memory acceleration)" which leverages on-die accelerator to offload gzip decompression in Velox parquet scan. Its overall design and PR context is illustrated in the issue: #5718 . In short, some queries perf can boost up to 40% against the query perf on zstd SW decompression based parquet scan. Regarding the latest update for Q15 and Q14, it is because it fix a skip-page call flow which is missing in the PR initial version. Hi @yaqi-zhao , be free to correct me or add more info if anything missing or incorrect. Thanks! |
@george-gu-2021 thank you for the context. From Kelly's presentation on the monthly OSS meeting my understanding was that IAA only supported compression (hence why we were evaluating and considering it for table writes). Does it actually support compression as well, or is it a different technology? Cc: @mbasmanova |
9172107
to
ef56608
Compare
@FelixYBW @pedroerp @george-gu-2021 I have created a discussion(#7445) . I add solution introduction and duplicated code analysis based on current PR. Please add your insights on this discussion. Thanks a lot! |
ef56608
to
929ef9f
Compare
@yaqi-zhao you may hold on the PR. Rong is creating the unified compression codec API, including sync and async. Let's finish the PR firstly. |
01b4fd5
to
4e00d39
Compare
4e00d39
to
227325b
Compare
@FelixYBW I can see that zlib window size 4KB is not a standard setup for parquet files. So even we add it to table scan, there is no file we can read to benefit from. Would it make more sense to add to shuffle first to see some real world benefits? |
Hi @Yuhta , window size 4KB is a parameter that Arrow exposes in its interfaces and users can set that per their preference while generating parquet files typically in ETL processing. Some partners are open to config that. In our current validation stage, we generate some 4KB zlib parquet stream with Velox (including Arrow module) and we are happy to share the generation process and sample parquet streams if anyone needs those. Thanks! |
|
// 'rowOfPage_' is the row number of the first row of the next page. | ||
this->rowOfPage_ += this->numRowsInPage_; | ||
|
||
if (seekToPreDecompPage(row)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe create some virtual hooks in the base class for these calls, then you don't need to duplicate the other part in subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have thought over this solution, but there will be a lot of code changes in the current PageReader
, do you think it is reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there will be a lot of more change? You just need to add a few more virtual functions with empty implementation and invoke them in the places needed. No existing logic will be touched.
} | ||
this->updateRowInfoAfterPageSkipped(); | ||
} | ||
if (isWinSizeFit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hook
|
||
return; | ||
} | ||
if (job_success) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hook
BufferPtr uncompressedData; | ||
}; | ||
|
||
class IAAPageReader : public PageReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems no need for PageReaderBase
if you are inheriting from the concrete PageReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to create PageReaderBase
is to realize Polymorphism, so that when ParquetData
call the same PageReaderBase
function behaves differently in different scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading from the code I think we can do it a little bit differently. What you really need in IAAPageReader
is a set of extension points (hooks) that you run extra code, in addition to the basic PageReader
. So you can add a few virtual methods in PageReader
, default to no-op, call them in the expected places. Then in IAAPageReader
you add the implementations for these hooks. Does it sound good to you?
dictionaryEncoding_ == Encoding::PLAIN); | ||
|
||
if (codec_ != thrift::CompressionCodec::UNCOMPRESSED) { | ||
if (job_success) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hook
Recently most SW stacks use the default setting (32KB) except some scenarios need to set "4KB" history buffer purposely to take care of memory capacity constraints to avoid OOM or spill operations. Regarding HW capability, the current generation IAA does have 4KB history buffer limitation as well. That is why we propose to add the logic to open the option for users to leverage IAA HW where it is applicable. Thanks! @Yuhta |
227325b
to
281e582
Compare
We already added it to Gluten's shuffle and are working on Spill through the unified Compression API codec in this PR: #7589 |
Hi @FelixYBW , partners will like the feature! Per the communication w/ them, they are highly interested in leveraging Gluten/Velox to conduct ETL and generate parquet data once feature is ready. |
281e582
to
43767b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should not be any major change in the Parquet folder. but, let's wait for #7471 to be merged first.
@@ -116,6 +116,7 @@ class ReaderBase { | |||
std::shared_ptr<const dwio::common::TypeWithId> schemaWithId_; | |||
|
|||
const bool binaryAsString = false; | |||
bool needPreDecomp = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be in dwio::common::compression
43767b5
to
036442a
Compare
036442a
to
484b4d3
Compare
b3ca532
to
4ab13b3
Compare
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
The Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator that provides very high
throughput compression and decompression combined with primitive analytic functions. It is available in the newest generation of Intel® Xeon® Scalable processors ("Sapphire Rapids").
We can offload the GZip (window size is 4KB) decompression to the IAA hardware and save the CPU bandwidth. Here is a description of how to offload the GZip decompression to the IAA hardware
#5718