Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45045: [C++][Parquet] Add a benchmark for size_statistics_level #45085

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Dec 20, 2024

Rationale for this change

Add a benchmark to know the performance of different size stats levels.

What changes are included in this PR?

Add a size_stats_benchmark for parquet.

Are these changes tested?

No

Are there any user-facing changes?

No

@wgtmac

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Dec 20, 2024

The slowdown on lists is a bit surprising, is it because of levels histograms?

Does the non-list case have nulls?

@wgtmac
Copy link
Member Author

wgtmac commented Dec 20, 2024

The slowdown on lists is a bit surprising, is it because of levels histograms?

Did you mean the slowdown from which one?

  1. T -> List[T] for same level.
  2. Level::None -> Level::ColumnChunk for List[T].

For 1, I think it is due to explosion of element sizes.
For 2, I think it is due to levels histograms because the string and integer types seem to have similar regression.

The data size is large enough that most iteration numbers are 1, which may affect judgement.

Does the non-list case have nulls?

Yes, the null probability is hard-coded to 50%.

@mapleFU
Copy link
Member

mapleFU commented Dec 20, 2024

BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, Compression::ZSTD>                         602237730 ns    602045500 ns            2 bytes_per_second=134.957Mi/s items_per_second=17.4169M/s output_size=46.1176M page_index_size=1.011k
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type, Compression::ZSTD>                  542915083 ns    542902000 ns            1 bytes_per_second=149.659Mi/s items_per_second=19.3143M/s output_size=46.1177M page_index_size=1.011k
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type, Compression::ZSTD>           520517541 ns    520418000 ns            1 bytes_per_second=156.124Mi/s items_per_second=20.1487M/s output_size=46.1181M page_index_size=1.417k

There must be some unstable test since the None encoding is slower...🤔

@wgtmac
Copy link
Member Author

wgtmac commented Dec 20, 2024

Yes, the iteration is 1 in most cases which is not stable.

@wgtmac

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Dec 20, 2024

Ok, some observations:

  1. can you arrange for List benchmarks to be shorter? there are still not enough iterations in these cases IMHO; kBenchmarkSize should probably denote the number of leaf values, not the number of top-level rows
  2. are the ZSTD benchmarks useful? they are just adding the compression overhead, we don't expect any significant insight about page index write speed there

@wgtmac
Copy link
Member Author

wgtmac commented Dec 20, 2024

are the ZSTD benchmarks useful? they are just adding the compression overhead, we don't expect any significant insight about page index write speed there

I was to observe the page index size overhead compared to the file size. Now it seems to be pretty trivial and thus compression can be removed.

@wgtmac
Copy link
Member Author

wgtmac commented Dec 21, 2024

Number of leaf values: 1024 * 1024

------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                 201108976 ns    201109929 ns           14 bytes_per_second=40.4008Mi/s items_per_second=5.21394M/s output_size=4.6232M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>          211061192 ns    210837077 ns           13 bytes_per_second=38.5369Mi/s items_per_second=4.97339M/s output_size=4.62321M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>   205393718 ns    205320000 ns           13 bytes_per_second=39.5724Mi/s items_per_second=5.10703M/s output_size=4.62325M page_index_size=139
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType>                214885667 ns    214848091 ns           11 bytes_per_second=42.47Mi/s items_per_second=4.88055M/s output_size=7.50526M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>         226451371 ns    224535909 ns           11 bytes_per_second=40.6376Mi/s items_per_second=4.66997M/s output_size=7.50528M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>  227898682 ns    227739545 ns           11 bytes_per_second=40.066Mi/s items_per_second=4.60428M/s output_size=7.50537M page_index_size=229
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                      206181004 ns    206177800 ns           10 bytes_per_second=41.4084Mi/s items_per_second=5.08579M/s output_size=4.89438M page_index_size=99
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>               238991846 ns    238962700 ns           10 bytes_per_second=35.7273Mi/s items_per_second=4.38803M/s output_size=4.89441M page_index_size=99
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>        217312880 ns    217313444 ns            9 bytes_per_second=39.2866Mi/s items_per_second=4.82518M/s output_size=4.89448M page_index_size=172
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType>                     229697459 ns    229696222 ns            9 bytes_per_second=41.5205Mi/s items_per_second=4.56506M/s output_size=7.77632M page_index_size=162
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>              236574281 ns    236410750 ns            8 bytes_per_second=40.3413Mi/s items_per_second=4.4354M/s output_size=7.77635M page_index_size=162
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>       233096755 ns    233096375 ns            8 bytes_per_second=40.9149Mi/s items_per_second=4.49847M/s output_size=7.77649M page_index_size=280

It is counter-intuitive that PageAndColumnChunk is faster than ColumnChunk for Int64 and List[Int64]

@pitrou
Copy link
Member

pitrou commented Dec 21, 2024

Thanks. By the way, can you disable dictionary encoding? Does it use PLAIN encoding?

@wgtmac
Copy link
Member Author

wgtmac commented Dec 23, 2024

After disabling dictionary, the result looks more reasonable.

------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                  98489675 ns     98475500 ns           10 bytes_per_second=82.5078Mi/s items_per_second=10.6481M/s output_size=4.32785M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>          113836396 ns    113687200 ns           10 bytes_per_second=71.468Mi/s items_per_second=9.22334M/s output_size=4.32787M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>   116892034 ns    116822900 ns           10 bytes_per_second=69.5497Mi/s items_per_second=8.97577M/s output_size=4.32791M page_index_size=139
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType>                210575609 ns    210545846 ns           13 bytes_per_second=43.3378Mi/s items_per_second=4.98027M/s output_size=7.47408M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>         211015493 ns    211015417 ns           12 bytes_per_second=43.2414Mi/s items_per_second=4.96919M/s output_size=7.4741M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>  211377809 ns    211377250 ns           12 bytes_per_second=43.1674Mi/s items_per_second=4.96069M/s output_size=7.47419M page_index_size=229
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                      148771942 ns    148768100 ns           10 bytes_per_second=57.388Mi/s items_per_second=7.04839M/s output_size=4.59879M page_index_size=99
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>               214763756 ns    214764231 ns           13 bytes_per_second=39.7529Mi/s items_per_second=4.88245M/s output_size=4.59881M page_index_size=99
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>        217214279 ns    217205077 ns           13 bytes_per_second=39.3062Mi/s items_per_second=4.82759M/s output_size=4.59889M page_index_size=172
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType>                     220088534 ns    220089000 ns           10 bytes_per_second=43.333Mi/s items_per_second=4.76433M/s output_size=7.74504M page_index_size=162
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>              228813218 ns    228813111 ns            9 bytes_per_second=41.6808Mi/s items_per_second=4.58267M/s output_size=7.74507M page_index_size=162
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>       231646792 ns    231642333 ns            9 bytes_per_second=41.1717Mi/s items_per_second=4.5267M/s output_size=7.74521M page_index_size=279

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants