Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record stats on write #2105

Merged
merged 1 commit into from
Jan 20, 2025
Merged

Record stats on write #2105

merged 1 commit into from
Jan 20, 2025

Conversation

willdealtry
Copy link
Collaborator

@willdealtry willdealtry commented Jan 7, 2025

Calculate min/max and unique count on write, roundtrip into storage for queries and as a first-pass for adaptive encodings

@willdealtry willdealtry force-pushed the record_stats_on_write branch from b2bd8d5 to e254f1d Compare January 8, 2025 18:05
@willdealtry willdealtry changed the title WIP Record stats on write Record stats on write Jan 10, 2025
@willdealtry willdealtry force-pushed the record_stats_on_write branch from 2856226 to 237d120 Compare January 10, 2025 17:02
@willdealtry willdealtry marked this pull request as ready for review January 10, 2025 17:03
@willdealtry willdealtry force-pushed the record_stats_on_write branch 2 times, most recently from b8b65b2 to c78348c Compare January 14, 2025 08:13
@@ -512,6 +512,86 @@ TEST(Segment, RoundtripTimeseriesDescriptorWriteToBufferV2) {
ASSERT_EQ(decoded, copy);
}

TEST(Segment, RoundtripStatisticsV1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that there's an argument for only saving the new stats in the v2 encoding? And leaving v1 frozen as much as we can?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that (not least since it would be a chunk less work), but it's only a field in the protobuf header, it's opt-in (and off by default) and potentially when you have a use-case where you write once and read lots of times it's going to make a massive difference, so I didn't want to have to force people to switch encoding in order to get the benefit of it.

max_ = base.max_;
unique_count_ = base.unique_count_;
unique_count_precision_ = base.unique_count_precision_;
set_ = base.set_;
Copy link
Collaborator

@poodlewars poodlewars Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the value of a bitset here over separate booleans for each statistic, given that each statistic has its own API methods anyway, but not a big deal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to keep things as small as possible for the binary encoding - I agree it doesn't make a lot of difference in V1

stats.compose<RawType>(local_stats);
} else if constexpr (is_dynamic_string_type(TagType::DataTypeTag::data_type)) {
auto local_stats = generate_string_statistics(std::span{ptr, count});
stats.compose<RawType>(local_stats);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that the string pool is shared across the whole column so we can just consider offsets in to it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. One of the performance advantages we'll be able to get is that at the moment when we want a unique string list for a segment (which is the precursor to various kinds of processing) we don't know how many unique items there are, whereas with stats we'll be able to just add that many offsets and then return. Which in the case where there's a low cardinality (like ['BID', 'ASK'] will make a huge difference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests where the column has multiple blocks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a refactor right? No logical changes intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from adding the calls to calculate the stats, it's just breaking up that function that had become really huge and difficult to follow.

Copy link
Collaborator

@poodlewars poodlewars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see the HyperLogLog implementation here, perhaps best to drop the enum value for it until it exists?

@willdealtry
Copy link
Collaborator Author

I can't see the HyperLogLog implementation here, perhaps best to drop the enum value for it until it exists?

I think it's the only other sensible way to do it, and as well as providing an implementation I'm trying to define a general data format. Can remove if it bothers you.

@willdealtry willdealtry force-pushed the record_stats_on_write branch from c78348c to ea584ac Compare January 16, 2025 16:08
@willdealtry willdealtry force-pushed the record_stats_on_write branch from ea584ac to 4f81838 Compare January 17, 2025 12:24
@willdealtry willdealtry merged commit f92a5ac into master Jan 20, 2025
151 of 152 checks passed
@willdealtry willdealtry deleted the record_stats_on_write branch January 20, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants