-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][format] Optimize manifest reading performance,add pushdown for manifest and orc. #4497
base: master
Are you sure you want to change the base?
Conversation
…ure pushdown works
…on vectors would not work
with #4231 together |
paimon-common/src/main/java/org/apache/paimon/format/FileFormat.java
Outdated
Show resolved
Hide resolved
paimon-common/src/main/java/org/apache/paimon/fs/ObjectCacheManager.java
Show resolved
Hide resolved
paimon-core/src/main/java/org/apache/paimon/manifest/ManifestFile.java
Outdated
Show resolved
Hide resolved
paimon-core/src/main/java/org/apache/paimon/operation/AbstractFileStoreScan.java
Outdated
Show resolved
Hide resolved
paimon-core/src/main/java/org/apache/paimon/table/source/InnerTableScan.java
Outdated
Show resolved
Hide resolved
Note: Since the cache code related to manifest and fileformat has been withdrawn in this PR and will be submitted to the next PR, this PR cannot yet achieve the performance introduced by Purpose. 注意:由于manifest和fileformat相关的cache代码本pr已经撤回,留到下一个pr提交,本pr尚不能达到Purpose介绍的性能。 |
|
||
return Optional.empty(); | ||
FileFormatFactory fileFormatFactory = | ||
FactoryUtil.discoverFactory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just create a PR for FileFormatFactory
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JingsongLi Of course you can, but I’ll change it in a few days. I’ve been a little busy lately.
当然可以,不过过几天再改,最近有点忙
* Read the corresponding entries based on the current required bucket, but push down into file | ||
* format . | ||
*/ | ||
private static List<Predicate> createPushDownFilter(Collection<Integer> buckets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the query performance mainly gain from the bucket field push down for the ORC manifest file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than half of the performance improvement comes from the orc pushdown of the manifest, the other part comes from the optimization of OrcFileFormat creation, and the other part comes from the caching of some time-consuming object operations on Scan.
性能提升一多半来自于manifest的orc下推,另外一部分来自于OrcFileFormat创建的优化,还有一部分来自于Scan上部分耗时的对象操作缓存 @Aitozi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entryType, | ||
fileFormat.createReaderFactory(entryType), | ||
fileFormat.createReaderFactory(entryType, filters), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we enable the reader filter and the manifest cache, will we miss data from other buckets when reading data from bucket-x
? Previously, data was stored in ObjectCache after passing through the loadFilter, but now it must pass through this filter first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ObjectCache is enabled and push-down withBuckets is used, the problem you mentioned may indeed occur. So I originally planned to add a Filter condition to ObjectCache, but it was too complicated to change and I didn't have so much time to do these things, so I could only push down withBuckets for the time being. Because, in most scenarios, there will be no problem. If it is in flink or spark, I have seen that withBuckets will not be called at all. If it is an olap query and the corresponding bucket is read in segments, the bucket and segment will remain mapped. There will be no problems with the relationship.
If it were not based on this consideration, I suggest that the partition should also be pushed down.
If you feel the risk is too great, you can even turn off the manifest's metadata cache, and the performance will still improve significantly. @Aitozi
如果开启了ObjectCache缓存,有使用了withBuckets的下推,确实可能出现你说的问题。所以我本来打算给ObjectCache增加一个Filter条件,但是改起来太复杂而我没有那么多时间做这些东西,只能暂时先把withBuckets下推做了。因为,大部分场景下都不会出现问题,如果是flink里面或者spark里面,我看了根本就不会调用withBuckets,如果是olap查询,分segment读取对应的bucket,则bucket和segment会保持映射关系,也不会出现问题。
如果不是基于这个考虑,我建议分区也应该下推。
如果你觉得风险太大,甚至可以关闭manifest的元数据缓存,性能依然提升很明显。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation, If we can not handle the push down when the cache enabled, I think we can disable the filter push down when the cache is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is recommended to choose the latter between metadata caching and manifest pushdown. The performance of paimon's ObjectCache implementation is very low. After testing, sometimes it is not even as fast as manifest pushdown. I will submit a PR later to fix the performance problem of ObjectCache.
在元数据缓存和manifest下推之间建议选择后者。paimon的ObjectCache实现的性能非常低,经测试有时候甚至比不上manifest下推快。后面我会提交一个pr修复ObjectCache的性能问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aitozi This is a scenario that is quite different from mainstream applications in the community. The author's internal analysis engine does not have the ability of a central node, and can only plan by each computing node themselves. Each computing node only cares about its own bucket.
Actually, this is more like a manifest cache in the writer node than the current design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JingsongLi In the writer node, it could still may need to read more than one bucket entry from the manifest if the parallelism is lower than the bucket number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aitozi It is true, there are problems in this PR's implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranxianglei thanks for your work, happy to see some effort to improve the manifest file reading performance, left two comments. |
Hi @ranxianglei The purpose of PR and review is not to achieve great accomplishments within a single PR, but to provide higher quality code and better architecture. |
I've been quite busy lately. I'll split the PR when I'm done. @JingsongLi |
00db1f6
to
c7a3776
Compare
Purpose
English:Optimize the manifest reading performance, optimize the format object creation performance, and reduce the total time spent on the actual test manifest to less than 3ms (of course there is room for optimization to reduce it to less than 1ms). With the orc push-down function turned on, the metadata format is changed to orc, which can handle high-concurrency (qps greater than 10,000) and low-latency (overall rt less than 50ms) scenarios.
Chinese:优化manifest读取性能,优化format对象创建性能,实际测试manifest总耗时降低到3ms以下(当然还有优化空间降低到1ms以下)。配合元数据缓存开启,orc下推功能开启,元数据格式改成orc,可以承接高并发(qps 大于1万)低延迟(整体rt 50ms以下)场景
Linked issue: close #xxx
Tests
API and Format
Documentation