-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record stats on write #2105
Record stats on write #2105
Conversation
b2bd8d5
to
e254f1d
Compare
2856226
to
237d120
Compare
b8b65b2
to
c78348c
Compare
@@ -512,6 +512,86 @@ TEST(Segment, RoundtripTimeseriesDescriptorWriteToBufferV2) { | |||
ASSERT_EQ(decoded, copy); | |||
} | |||
|
|||
TEST(Segment, RoundtripStatisticsV1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that there's an argument for only saving the new stats in the v2 encoding? And leaving v1 frozen as much as we can?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that (not least since it would be a chunk less work), but it's only a field in the protobuf header, it's opt-in (and off by default) and potentially when you have a use-case where you write once and read lots of times it's going to make a massive difference, so I didn't want to have to force people to switch encoding in order to get the benefit of it.
max_ = base.max_; | ||
unique_count_ = base.unique_count_; | ||
unique_count_precision_ = base.unique_count_precision_; | ||
set_ = base.set_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the value of a bitset here over separate booleans for each statistic, given that each statistic has its own API methods anyway, but not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to keep things as small as possible for the binary encoding - I agree it doesn't make a lot of difference in V1
stats.compose<RawType>(local_stats); | ||
} else if constexpr (is_dynamic_string_type(TagType::DataTypeTag::data_type)) { | ||
auto local_stats = generate_string_statistics(std::span{ptr, count}); | ||
stats.compose<RawType>(local_stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that the string pool is shared across the whole column so we can just consider offsets in to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. One of the performance advantages we'll be able to get is that at the moment when we want a unique string list for a segment (which is the precursor to various kinds of processing) we don't know how many unique items there are, whereas with stats we'll be able to just add that many offsets and then return. Which in the case where there's a low cardinality (like ['BID', 'ASK'] will make a huge difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests where the column has multiple blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a refactor right? No logical changes intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from adding the calls to calculate the stats, it's just breaking up that function that had become really huge and difficult to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see the HyperLogLog
implementation here, perhaps best to drop the enum value for it until it exists?
I think it's the only other sensible way to do it, and as well as providing an implementation I'm trying to define a general data format. Can remove if it bothers you. |
c78348c
to
ea584ac
Compare
ea584ac
to
4f81838
Compare
Calculate min/max and unique count on write, roundtrip into storage for queries and as a first-pass for adaptive encodings