GH-45045: [C++][Parquet] Add a benchmark for size_statistics_level #45085

wgtmac · 2024-12-20T06:22:13Z

Rationale for this change

Add a benchmark to know the performance of writing different size stats levels.

What changes are included in this PR?

Add a size_stats_benchmark for parquet.

Are these changes tested?

No

Are there any user-facing changes?

No

GitHub Issue: [C++][Parquet] Add a benchmark for size_statistics_level #45045

pitrou · 2024-12-20T09:01:51Z

The slowdown on lists is a bit surprising, is it because of levels histograms?

Does the non-list case have nulls?

wgtmac · 2024-12-20T09:16:23Z

The slowdown on lists is a bit surprising, is it because of levels histograms?

Did you mean the slowdown from which one?

T -> List[T] for same level.
Level::None -> Level::ColumnChunk for List[T].

For 1, I think it is due to explosion of element sizes.
For 2, I think it is due to levels histograms because the string and integer types seem to have similar regression.

The data size is large enough that most iteration numbers are 1, which may affect judgement.

Does the non-list case have nulls?

Yes, the null probability is hard-coded to 50%.

mapleFU · 2024-12-20T09:29:08Z

BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, Compression::ZSTD>                         602237730 ns    602045500 ns            2 bytes_per_second=134.957Mi/s items_per_second=17.4169M/s output_size=46.1176M page_index_size=1.011k
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type, Compression::ZSTD>                  542915083 ns    542902000 ns            1 bytes_per_second=149.659Mi/s items_per_second=19.3143M/s output_size=46.1177M page_index_size=1.011k
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type, Compression::ZSTD>           520517541 ns    520418000 ns            1 bytes_per_second=156.124Mi/s items_per_second=20.1487M/s output_size=46.1181M page_index_size=1.417k

There must be some unstable test since the None encoding is slower...🤔

wgtmac · 2024-12-20T09:31:00Z

Yes, the iteration is 1 in most cases which is not stable.

wgtmac · 2024-12-20T14:37:12Z

I have changed kBenchmarkSize from 10_000_000 to 1_000_000 and below is the result:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, Compression::UNCOMPRESSED>                 199734637 ns    199733786 ns           14 bytes_per_second=40.6791Mi/s items_per_second=5.24987M/s output_size=4.6232M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type, Compression::UNCOMPRESSED>          204370481 ns    204321538 ns           13 bytes_per_second=39.7658Mi/s items_per_second=5.13199M/s output_size=4.62321M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type, Compression::UNCOMPRESSED>   209760009 ns    209680923 ns           13 bytes_per_second=38.7494Mi/s items_per_second=5.00082M/s output_size=4.62325M page_index_size=139
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType, Compression::UNCOMPRESSED>                231955434 ns    231875167 ns           12 bytes_per_second=39.3514Mi/s items_per_second=4.52216M/s output_size=7.50526M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType, Compression::UNCOMPRESSED>         237242318 ns    236886091 ns           11 bytes_per_second=38.519Mi/s items_per_second=4.4265M/s output_size=7.50528M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType, Compression::UNCOMPRESSED>  223389735 ns    223350000 ns           11 bytes_per_second=40.8534Mi/s items_per_second=4.69477M/s output_size=7.50537M page_index_size=229
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, Compression::UNCOMPRESSED>                      365880153 ns    365687333 ns            3 bytes_per_second=122.363Mi/s items_per_second=2.86741M/s output_size=43.9176M page_index_size=894
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type, Compression::UNCOMPRESSED>               443278611 ns    442877333 ns            3 bytes_per_second=101.036Mi/s items_per_second=2.36764M/s output_size=43.9176M page_index_size=894
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type, Compression::UNCOMPRESSED>        363161333 ns    363023000 ns            2 bytes_per_second=123.261Mi/s items_per_second=2.88846M/s output_size=43.9182M page_index_size=1.54k
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType, Compression::UNCOMPRESSED>                     519121292 ns    519106000 ns            2 bytes_per_second=143.037Mi/s items_per_second=2.01997M/s output_size=74.8029M page_index_size=1.544k
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType, Compression::UNCOMPRESSED>              584476896 ns    584031000 ns            2 bytes_per_second=127.136Mi/s items_per_second=1.79541M/s output_size=74.803M page_index_size=1.544k
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType, Compression::UNCOMPRESSED>       591756459 ns    591446500 ns            2 bytes_per_second=125.542Mi/s items_per_second=1.7729M/s output_size=74.8042M page_index_size=2.588k
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, Compression::ZSTD>                         182079277 ns    181948818 ns           11 bytes_per_second=44.6554Mi/s items_per_second=5.76303M/s output_size=4.61017M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type, Compression::ZSTD>                  204989156 ns    204887417 ns           12 bytes_per_second=39.6559Mi/s items_per_second=5.11782M/s output_size=4.61019M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type, Compression::ZSTD>           211684645 ns    211571000 ns           12 bytes_per_second=38.4032Mi/s items_per_second=4.95614M/s output_size=4.61023M page_index_size=139
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType, Compression::ZSTD>                        246236464 ns    246142625 ns            8 bytes_per_second=37.0704Mi/s items_per_second=4.26003M/s output_size=3.51222M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType, Compression::ZSTD>                 253982437 ns    253829125 ns            8 bytes_per_second=35.9478Mi/s items_per_second=4.13103M/s output_size=3.51224M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType, Compression::ZSTD>          236436298 ns    236416571 ns            7 bytes_per_second=38.5955Mi/s items_per_second=4.43529M/s output_size=3.51233M page_index_size=229
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, Compression::ZSTD>                              409220236 ns    409182000 ns            3 bytes_per_second=109.357Mi/s items_per_second=2.56262M/s output_size=43.3514M page_index_size=894
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type, Compression::ZSTD>                       385995105 ns    385996500 ns            2 bytes_per_second=115.925Mi/s items_per_second=2.71654M/s output_size=43.3514M page_index_size=894
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type, Compression::ZSTD>                423469083 ns    422697500 ns            2 bytes_per_second=105.86Mi/s items_per_second=2.48068M/s output_size=43.352M page_index_size=1.54k
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType, Compression::ZSTD>                             691726084 ns    691472000 ns            1 bytes_per_second=107.381Mi/s items_per_second=1.51644M/s output_size=32.9562M page_index_size=1.544k
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType, Compression::ZSTD>                      759628875 ns    759026000 ns            1 bytes_per_second=97.8244Mi/s items_per_second=1.38148M/s output_size=32.9563M page_index_size=1.544k
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType, Compression::ZSTD>               829352332 ns    801603000 ns            1 bytes_per_second=92.6285Mi/s items_per_second=1.3081M/s output_size=32.9575M page_index_size=2.588k

pitrou · 2024-12-20T15:08:20Z

Ok, some observations:

can you arrange for List benchmarks to be shorter? there are still not enough iterations in these cases IMHO; kBenchmarkSize should probably denote the number of leaf values, not the number of top-level rows
are the ZSTD benchmarks useful? they are just adding the compression overhead, we don't expect any significant insight about page index write speed there

wgtmac · 2024-12-20T15:16:16Z

are the ZSTD benchmarks useful? they are just adding the compression overhead, we don't expect any significant insight about page index write speed there

I was to observe the page index size overhead compared to the file size. Now it seems to be pretty trivial and thus compression can be removed.

wgtmac · 2024-12-21T06:45:50Z

Number of leaf values: 1024 * 1024

------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                 201108976 ns    201109929 ns           14 bytes_per_second=40.4008Mi/s items_per_second=5.21394M/s output_size=4.6232M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>          211061192 ns    210837077 ns           13 bytes_per_second=38.5369Mi/s items_per_second=4.97339M/s output_size=4.62321M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>   205393718 ns    205320000 ns           13 bytes_per_second=39.5724Mi/s items_per_second=5.10703M/s output_size=4.62325M page_index_size=139
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType>                214885667 ns    214848091 ns           11 bytes_per_second=42.47Mi/s items_per_second=4.88055M/s output_size=7.50526M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>         226451371 ns    224535909 ns           11 bytes_per_second=40.6376Mi/s items_per_second=4.66997M/s output_size=7.50528M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>  227898682 ns    227739545 ns           11 bytes_per_second=40.066Mi/s items_per_second=4.60428M/s output_size=7.50537M page_index_size=229
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                      206181004 ns    206177800 ns           10 bytes_per_second=41.4084Mi/s items_per_second=5.08579M/s output_size=4.89438M page_index_size=99
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>               238991846 ns    238962700 ns           10 bytes_per_second=35.7273Mi/s items_per_second=4.38803M/s output_size=4.89441M page_index_size=99
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>        217312880 ns    217313444 ns            9 bytes_per_second=39.2866Mi/s items_per_second=4.82518M/s output_size=4.89448M page_index_size=172
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType>                     229697459 ns    229696222 ns            9 bytes_per_second=41.5205Mi/s items_per_second=4.56506M/s output_size=7.77632M page_index_size=162
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>              236574281 ns    236410750 ns            8 bytes_per_second=40.3413Mi/s items_per_second=4.4354M/s output_size=7.77635M page_index_size=162
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>       233096755 ns    233096375 ns            8 bytes_per_second=40.9149Mi/s items_per_second=4.49847M/s output_size=7.77649M page_index_size=280

It is counter-intuitive that PageAndColumnChunk is faster than ColumnChunk for Int64 and List[Int64]

pitrou · 2024-12-21T18:24:05Z

Thanks. By the way, can you disable dictionary encoding? Does it use PLAIN encoding?

wgtmac · 2024-12-23T02:08:08Z

After disabling dictionary, the result looks more reasonable.

------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                  98489675 ns     98475500 ns           10 bytes_per_second=82.5078Mi/s items_per_second=10.6481M/s output_size=4.32785M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>          113836396 ns    113687200 ns           10 bytes_per_second=71.468Mi/s items_per_second=9.22334M/s output_size=4.32787M page_index_size=99
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>   116892034 ns    116822900 ns           10 bytes_per_second=69.5497Mi/s items_per_second=8.97577M/s output_size=4.32791M page_index_size=139
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType>                210575609 ns    210545846 ns           13 bytes_per_second=43.3378Mi/s items_per_second=4.98027M/s output_size=7.47408M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>         211015493 ns    211015417 ns           12 bytes_per_second=43.2414Mi/s items_per_second=4.96919M/s output_size=7.4741M page_index_size=162
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>  211377809 ns    211377250 ns           12 bytes_per_second=43.1674Mi/s items_per_second=4.96069M/s output_size=7.47419M page_index_size=229
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                      148771942 ns    148768100 ns           10 bytes_per_second=57.388Mi/s items_per_second=7.04839M/s output_size=4.59879M page_index_size=99
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>               214763756 ns    214764231 ns           13 bytes_per_second=39.7529Mi/s items_per_second=4.88245M/s output_size=4.59881M page_index_size=99
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>        217214279 ns    217205077 ns           13 bytes_per_second=39.3062Mi/s items_per_second=4.82759M/s output_size=4.59889M page_index_size=172
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType>                     220088534 ns    220089000 ns           10 bytes_per_second=43.333Mi/s items_per_second=4.76433M/s output_size=7.74504M page_index_size=162
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>              228813218 ns    228813111 ns            9 bytes_per_second=41.6808Mi/s items_per_second=4.58267M/s output_size=7.74507M page_index_size=162
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>       231646792 ns    231642333 ns            9 bytes_per_second=41.1717Mi/s items_per_second=4.5267M/s output_size=7.74521M page_index_size=279

wgtmac · 2025-01-08T05:34:34Z

Do you want me to change anything? @pitrou

From the result, I don't think we should enable size stats by default. cc @emkornfield

emkornfield · 2025-01-08T05:48:01Z

Thanks for the benchmarks. am I reading this correct there is a about 30% regression for plain encoded integers but otherwise it does not add substantial overhead? If something like lz4 raw is used for compression do we still see such a large regression for integers? (I'd assume most people are doing at least some form of light-weight compression)

wgtmac · 2025-01-08T06:05:34Z

@emkornfield I did an experiment with zstd before: #45085 (comment)

pitrou · 2025-01-08T08:41:47Z

cpp/src/parquet/arrow/size_stats_benchmark.cc

+
+// This should result in multiple pages for most primitive types
+constexpr int64_t kBenchmarkSize = 1024 * 1024;
+constexpr double kNullProbability = 0.5;


This is probably a worst case for definition levels encoding. Perhaps 0.9 or 0.95 would exhibit the size statistics overhead even more?

kNullProbability = 0.95

------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 61819253 ns 61819750 ns 20 bytes_per_second=131.43Mi/s items_per_second=16.9618M/s output_size=546.091k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 55140742 ns 54834400 ns 10 bytes_per_second=148.173Mi/s items_per_second=19.1226M/s output_size=546.107k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 53924121 ns 53887400 ns 10 bytes_per_second=150.777Mi/s items_per_second=19.4586M/s output_size=546.121k page_index_size=47 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 51773791 ns 51774500 ns 10 bytes_per_second=89.4236Mi/s items_per_second=20.2527M/s output_size=864.083k page_index_size=30 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 65489058 ns 65488500 ns 10 bytes_per_second=70.6973Mi/s items_per_second=16.0116M/s output_size=864.103k page_index_size=30 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 65241288 ns 65241300 ns 10 bytes_per_second=70.9652Mi/s items_per_second=16.0723M/s output_size=864.122k page_index_size=44 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 72174783 ns 72174900 ns 10 bytes_per_second=118.289Mi/s items_per_second=14.5283M/s output_size=625.915k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 102759675 ns 102760300 ns 10 bytes_per_second=83.0817Mi/s items_per_second=10.2041M/s output_size=625.937k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 105034546 ns 105034000 ns 10 bytes_per_second=81.2832Mi/s items_per_second=9.98321M/s output_size=625.957k page_index_size=54 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 92049333 ns 92049200 ns 10 bytes_per_second=54.779Mi/s items_per_second=11.3915M/s output_size=944.123k page_index_size=31 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 122477704 ns 122478100 ns 10 bytes_per_second=41.1695Mi/s items_per_second=8.56133M/s output_size=944.149k page_index_size=31 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 121217775 ns 121217000 ns 10 bytes_per_second=41.5978Mi/s items_per_second=8.6504M/s output_size=944.174k page_index_size=51

pitrou · 2025-01-08T08:51:05Z

Sorry for the delay @wgtmac . I posted a small suggestion but this looks good to me.

Later, it would be good to investigate the cause of the overhead. It is surprising that there is a larger overhead for Int64 than for String. The numbers you posted in #45085 (comment) let me compute the following overheads (in ns/item):

	Primitive	List
Int64	17.5	65
String	0.8	11

pitrou · 2025-01-08T09:03:38Z

Oh, wait, the benchmark is broken. Will push a fix to get more reasonable numbers.

pitrou · 2025-01-08T09:09:36Z

Ok, I now get these numbers locally:

------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                   8213595 ns      8209893 ns           84 bytes_per_second=989.66Mi/s items_per_second=127.721M/s output_size=537.472k page_index_size=33
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>            9757681 ns      9753633 ns           72 bytes_per_second=833.023Mi/s items_per_second=107.506M/s output_size=537.488k page_index_size=33
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>     9768503 ns      9764686 ns           72 bytes_per_second=832.08Mi/s items_per_second=107.385M/s output_size=537.502k page_index_size=47

BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType>                 10229574 ns     10226100 ns           68 bytes_per_second=451.728Mi/s items_per_second=102.539M/s output_size=848.305k page_index_size=34
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>          11847455 ns     11843439 ns           55 bytes_per_second=390.04Mi/s items_per_second=88.5364M/s output_size=848.325k page_index_size=34
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>   11808794 ns     11804771 ns           59 bytes_per_second=391.318Mi/s items_per_second=88.8265M/s output_size=848.344k page_index_size=48

BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                       13141463 ns     13130383 ns           53 bytes_per_second=650.21Mi/s items_per_second=79.8588M/s output_size=617.464k page_index_size=34
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>                16485608 ns     16472720 ns           43 bytes_per_second=518.281Mi/s items_per_second=63.6553M/s output_size=617.486k page_index_size=34
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>         16315535 ns     16302010 ns           43 bytes_per_second=523.709Mi/s items_per_second=64.3219M/s output_size=617.506k page_index_size=54

BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType>                      15384749 ns     15370516 ns           46 bytes_per_second=327.375Mi/s items_per_second=68.22M/s output_size=927.326k page_index_size=35
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>               19044082 ns     19027424 ns           36 bytes_per_second=264.456Mi/s items_per_second=55.1087M/s output_size=927.352k page_index_size=35
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>        18860937 ns     18843797 ns           37 bytes_per_second=267.033Mi/s items_per_second=55.6457M/s output_size=927.377k page_index_size=55

pitrou · 2025-01-08T09:13:52Z

And now the size statistics overhead in ns/item is more consistent (and smaller!):

	Primitive	List
Int64	1.48	3.04
String	1.50	3.22

wgtmac · 2025-01-08T09:17:37Z

What about the null probability? 0.5 or 0.95?

pitrou · 2025-01-08T09:18:56Z

My latest push makes it 0.95. Can you check the changes look ok to you?

wgtmac · 2025-01-08T09:23:56Z

Yes, it looks good. Thanks!

pitrou

LGTM, will merge. Thanks @wgtmac !

…stics (#45202) ### Rationale for this change We found out in #45085 that there is a non-trivial overhead when writing size statistics is enabled. ### What changes are included in this PR? Dramatically reduce overhead by speeding up def/rep levels histogram updates. Performance results on the author's machine: ``` ------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 8103053 ns 8098569 ns 86 bytes_per_second=1003.26Mi/s items_per_second=129.477M/s output_size=537.472k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 8153499 ns 8148492 ns 86 bytes_per_second=997.117Mi/s items_per_second=128.683M/s output_size=537.488k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 8212560 ns 8207754 ns 83 bytes_per_second=989.918Mi/s items_per_second=127.754M/s output_size=537.502k page_index_size=47 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 10405020 ns 10400775 ns 67 bytes_per_second=444.142Mi/s items_per_second=100.817M/s output_size=848.305k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 10464784 ns 10460778 ns 66 bytes_per_second=441.594Mi/s items_per_second=100.239M/s output_size=848.325k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 10469832 ns 10465739 ns 67 bytes_per_second=441.385Mi/s items_per_second=100.191M/s output_size=848.344k page_index_size=48 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 13004962 ns 12992678 ns 52 bytes_per_second=657.101Mi/s items_per_second=80.7052M/s output_size=617.464k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 13718352 ns 13705599 ns 50 bytes_per_second=622.921Mi/s items_per_second=76.5071M/s output_size=617.486k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 13845553 ns 13832138 ns 52 bytes_per_second=617.222Mi/s items_per_second=75.8072M/s output_size=617.506k page_index_size=54 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 15715263 ns 15702707 ns 44 bytes_per_second=320.449Mi/s items_per_second=66.7768M/s output_size=927.326k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 16507328 ns 16493800 ns 43 bytes_per_second=305.079Mi/s items_per_second=63.5739M/s output_size=927.352k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 16575359 ns 16561311 ns 42 bytes_per_second=303.836Mi/s items_per_second=63.3148M/s output_size=927.377k page_index_size=55 ``` Performance results without this PR: ``` ------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 8042576 ns 8037678 ns 87 bytes_per_second=1010.86Mi/s items_per_second=130.458M/s output_size=537.472k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 9576627 ns 9571279 ns 73 bytes_per_second=848.894Mi/s items_per_second=109.554M/s output_size=537.488k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 9570204 ns 9563595 ns 73 bytes_per_second=849.576Mi/s items_per_second=109.642M/s output_size=537.502k page_index_size=47 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 10165397 ns 10160868 ns 69 bytes_per_second=454.628Mi/s items_per_second=103.197M/s output_size=848.305k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 11662568 ns 11657396 ns 60 bytes_per_second=396.265Mi/s items_per_second=89.9494M/s output_size=848.325k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 11657135 ns 11653063 ns 60 bytes_per_second=396.412Mi/s items_per_second=89.9829M/s output_size=848.344k page_index_size=48 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 13182006 ns 13168704 ns 51 bytes_per_second=648.318Mi/s items_per_second=79.6264M/s output_size=617.464k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 16438205 ns 16421762 ns 43 bytes_per_second=519.89Mi/s items_per_second=63.8528M/s output_size=617.486k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 16424615 ns 16409032 ns 42 bytes_per_second=520.293Mi/s items_per_second=63.9024M/s output_size=617.506k page_index_size=54 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 15387808 ns 15373086 ns 46 bytes_per_second=327.32Mi/s items_per_second=68.2086M/s output_size=927.326k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 18319628 ns 18302938 ns 37 bytes_per_second=274.924Mi/s items_per_second=57.29M/s output_size=927.352k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 18346665 ns 18329336 ns 37 bytes_per_second=274.528Mi/s items_per_second=57.2075M/s output_size=927.377k page_index_size=55 ``` ### Are these changes tested? Tested by existing tests, validated by existing benchmarks. ### Are there any user-facing changes? No. * GitHub Issue: #45201 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

conbench-apache-arrow · 2025-01-15T03:22:26Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0804ba6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Dec 20, 2024

This comment was marked as outdated.

Sign in to view

pitrou reviewed Jan 8, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 8, 2025

wgtmac added 8 commits January 8, 2025 10:08

apacheGH-45045: [C++][Parquet] Add a benchmark for size_statistics_level

8d47eaa

add header

112c237

add cast

3ad4539

fix lint

01c4e7e

add codec and page index size

51be91f

fix windows lint

0cd52e2

make leaf values consistent

ee8acf0

disable dict and enforce plain

253e93b

pitrou added 2 commits January 8, 2025 10:08

Fix benchmark

ced4dfd

Use skewed null prob

c34501e

pitrou force-pushed the size_stats_bench branch from 59c73a9 to c34501e Compare January 8, 2025 09:08

pitrou approved these changes Jan 8, 2025

View reviewed changes

pitrou merged commit 0804ba6 into apache:main Jan 8, 2025
34 of 35 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Jan 8, 2025

This was referenced Jan 8, 2025

[C++][Parquet] Add a benchmark for size_statistics_level #45045

Closed

[C++][Parquet] Improve performance of writing size statistics #45201

Closed

GH-45201: [C++][Parquet] Improve performance of generating size statistics #45202

Merged

github-actions bot added the awaiting committer review Awaiting committer review label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-45045: [C++][Parquet] Add a benchmark for size_statistics_level #45085

GH-45045: [C++][Parquet] Add a benchmark for size_statistics_level #45085

wgtmac commented Dec 20, 2024 •

edited by pitrou

Loading

This comment was marked as outdated.

pitrou commented Dec 20, 2024 •

edited

Loading

wgtmac commented Dec 20, 2024

mapleFU commented Dec 20, 2024

wgtmac commented Dec 20, 2024

wgtmac commented Dec 20, 2024

pitrou commented Dec 20, 2024 •

edited

Loading

wgtmac commented Dec 20, 2024

wgtmac commented Dec 21, 2024 •

edited

Loading

pitrou commented Dec 21, 2024

wgtmac commented Dec 23, 2024

wgtmac commented Jan 8, 2025

emkornfield commented Jan 8, 2025

wgtmac commented Jan 8, 2025

pitrou Jan 8, 2025

wgtmac Jan 8, 2025

pitrou commented Jan 8, 2025

pitrou commented Jan 8, 2025

pitrou commented Jan 8, 2025

pitrou commented Jan 8, 2025

wgtmac commented Jan 8, 2025

pitrou commented Jan 8, 2025

wgtmac commented Jan 8, 2025

pitrou left a comment

conbench-apache-arrow bot commented Jan 15, 2025

GH-45045: [C++][Parquet] Add a benchmark for size_statistics_level #45085

GH-45045: [C++][Parquet] Add a benchmark for size_statistics_level #45085

Conversation

wgtmac commented Dec 20, 2024 • edited by pitrou Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

This comment was marked as outdated.

pitrou commented Dec 20, 2024 • edited Loading

wgtmac commented Dec 20, 2024

mapleFU commented Dec 20, 2024

wgtmac commented Dec 20, 2024

wgtmac commented Dec 20, 2024

pitrou commented Dec 20, 2024 • edited Loading

wgtmac commented Dec 20, 2024

wgtmac commented Dec 21, 2024 • edited Loading

pitrou commented Dec 21, 2024

wgtmac commented Dec 23, 2024

wgtmac commented Jan 8, 2025

emkornfield commented Jan 8, 2025

wgtmac commented Jan 8, 2025

pitrou Jan 8, 2025

Choose a reason for hiding this comment

wgtmac Jan 8, 2025

Choose a reason for hiding this comment

pitrou commented Jan 8, 2025

pitrou commented Jan 8, 2025

pitrou commented Jan 8, 2025

pitrou commented Jan 8, 2025

wgtmac commented Jan 8, 2025

pitrou commented Jan 8, 2025

wgtmac commented Jan 8, 2025

pitrou left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jan 15, 2025

wgtmac commented Dec 20, 2024 •

edited by pitrou

Loading

pitrou commented Dec 20, 2024 •

edited

Loading

pitrou commented Dec 20, 2024 •

edited

Loading

wgtmac commented Dec 21, 2024 •

edited

Loading