Record stats on write #2105

willdealtry · 2025-01-07T15:40:14Z

Calculate min/max and unique count on write, roundtrip into storage for queries and as a first-pass for adaptive encodings

poodlewars · 2025-01-15T09:50:41Z

cpp/arcticdb/codec/test/test_codec.cpp

@@ -512,6 +512,86 @@ TEST(Segment, RoundtripTimeseriesDescriptorWriteToBufferV2) {
    ASSERT_EQ(decoded, copy);
 }

+TEST(Segment, RoundtripStatisticsV1) {


Do you think that there's an argument for only saving the new stats in the v2 encoding? And leaving v1 frozen as much as we can?

I thought about that (not least since it would be a chunk less work), but it's only a field in the protobuf header, it's opt-in (and off by default) and potentially when you have a use-case where you write once and read lots of times it's going to make a massive difference, so I didn't want to have to force people to switch encoding in order to get the benefit of it.

poodlewars · 2025-01-15T10:11:58Z

cpp/arcticdb/column_store/statistics.hpp

+        max_ = base.max_;
+        unique_count_ = base.unique_count_;
+        unique_count_precision_ = base.unique_count_precision_;
+        set_ = base.set_;


I don't get the value of a bitset here over separate booleans for each statistic, given that each statistic has its own API methods anyway, but not a big deal

Just trying to keep things as small as possible for the binary encoding - I agree it doesn't make a lot of difference in V1

cpp/arcticdb/column_store/statistics.hpp

poodlewars · 2025-01-15T10:23:47Z

cpp/arcticdb/column_store/statistics.hpp

+                stats.compose<RawType>(local_stats);
+            } else if constexpr (is_dynamic_string_type(TagType::DataTypeTag::data_type)) {
+                auto local_stats = generate_string_statistics(std::span{ptr, count});
+                stats.compose<RawType>(local_stats);


The idea is that the string pool is shared across the whole column so we can just consider offsets in to it?

Yes. One of the performance advantages we'll be able to get is that at the moment when we want a unique string list for a segment (which is the precursor to various kinds of processing) we don't know how many unique items there are, whereas with stats we'll be able to just add that many offsets and then return. Which in the case where there's a low cardinality (like ['BID', 'ASK'] will make a huge difference.

poodlewars · 2025-01-15T10:25:17Z

cpp/arcticdb/column_store/test/test_column.cpp

Tests where the column has multiple blocks?

poodlewars · 2025-01-15T10:34:47Z

cpp/arcticdb/pipeline/frame_utils.hpp

This is just a refactor right? No logical changes intended?

Apart from adding the calls to calculate the stats, it's just breaking up that function that had become really huge and difficult to follow.

poodlewars

I can't see the HyperLogLog implementation here, perhaps best to drop the enum value for it until it exists?

willdealtry · 2025-01-16T16:01:16Z

I can't see the HyperLogLog implementation here, perhaps best to drop the enum value for it until it exists?

I think it's the only other sensible way to do it, and as well as providing an implementation I'm trying to define a general data format. Can remove if it bothers you.

willdealtry force-pushed the record_stats_on_write branch from b2bd8d5 to e254f1d Compare January 8, 2025 18:05

willdealtry changed the title ~~WIP Record stats on write~~ Record stats on write Jan 10, 2025

willdealtry force-pushed the record_stats_on_write branch from 2856226 to 237d120 Compare January 10, 2025 17:02

willdealtry marked this pull request as ready for review January 10, 2025 17:03

willdealtry requested review from alexowens90 and poodlewars as code owners January 10, 2025 17:03

willdealtry force-pushed the record_stats_on_write branch 2 times, most recently from b8b65b2 to c78348c Compare January 14, 2025 08:13

poodlewars reviewed Jan 15, 2025

View reviewed changes

cpp/arcticdb/column_store/statistics.hpp Show resolved Hide resolved

poodlewars reviewed Jan 15, 2025

View reviewed changes

poodlewars approved these changes Jan 15, 2025

View reviewed changes

alexowens90 approved these changes Jan 16, 2025

View reviewed changes

willdealtry force-pushed the record_stats_on_write branch from c78348c to ea584ac Compare January 16, 2025 16:08

Refactor aggregator set data, add statistics

4f81838

willdealtry force-pushed the record_stats_on_write branch from ea584ac to 4f81838 Compare January 17, 2025 12:24

willdealtry merged commit f92a5ac into master Jan 20, 2025
151 of 152 checks passed

willdealtry deleted the record_stats_on_write branch January 20, 2025 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record stats on write #2105

Record stats on write #2105

willdealtry commented Jan 7, 2025 •

edited

Loading

poodlewars Jan 15, 2025

willdealtry Jan 15, 2025

poodlewars Jan 15, 2025 •

edited

Loading

willdealtry Jan 15, 2025

poodlewars Jan 15, 2025

willdealtry Jan 15, 2025

poodlewars Jan 15, 2025

poodlewars Jan 15, 2025

willdealtry Jan 15, 2025

poodlewars left a comment

willdealtry commented Jan 16, 2025

Record stats on write #2105

Record stats on write #2105

Conversation

willdealtry commented Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars left a comment

Choose a reason for hiding this comment

willdealtry commented Jan 16, 2025

willdealtry commented Jan 7, 2025 •

edited

Loading

poodlewars Jan 15, 2025 •

edited

Loading