Calculate size based flush threshold per topic #14765

itschrispeck · 2025-01-07T03:37:42Z

With multi-stream ingestion added, it makes sense to calculate segment flush thresholds based on individual topics, not at a table level. This patch has been running internally, behavior was validated by checking logs for calculated thresholds.

The metrics should be backwards compatible (at least for our metrics system they are, not familiar w/ all). For example, the gauge essentially changes from numRowsThreshold.tableName to numRowsThreshold.tableName.topicName

Jackie-Jiang

Can you elaborate more on why do we want to separate size info for each topic? If the data is ingested into the same table, they should have the same schema, and most of the case similar data distribution. I guess in most cases user still want to track per table threshold.
If this is indeed needed, can we make it configurable, and add a config flag to enable it?

codecov-commenter · 2025-01-07T04:19:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.67%. Comparing base (59551e4) to head (b3b5656).
Report is 1677 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14765      +/-   ##
============================================
+ Coverage     61.75%   63.67%   +1.92%     
- Complexity      207     1483    +1276     
============================================
  Files          2436     2716     +280     
  Lines        133233   152231   +18998     
  Branches      20636    23530    +2894     
============================================
+ Hits          82274    96937   +14663     
- Misses        44911    47993    +3082     
- Partials       6048     7301    +1253

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.64% <100.00%> (+1.93%)`	⬆️
java-21	`63.56% <100.00%> (+1.94%)`	⬆️
skip-bytebuffers-false	`63.65% <100.00%> (+1.91%)`	⬆️
skip-bytebuffers-true	`63.54% <100.00%> (+35.82%)`	⬆️
temurin	`63.67% <100.00%> (+1.92%)`	⬆️
unittests	`63.67% <100.00%> (+1.92%)`	⬆️
unittests1	`56.17% <ø> (+9.28%)`	⬆️
unittests2	`34.03% <100.00%> (+6.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

itschrispeck · 2025-01-07T04:42:25Z

Can you elaborate more on why do we want to separate size info for each topic? If the data is ingested into the same table, they should have the same schema, and most of the case similar data distribution. I guess in most cases user still want to track per table threshold.

For our multi-stream ingestion use case data rarely has the same schema or a similar data distribution. For us, topics are logically separate datasets. The current implementation lead to many index build failures, e.g. forward index size, too many MV values, etc. Beyond build failures, we also saw wild swings in segment sizes as many segments from one topic flushed, and then many segments from another topic flushed.

A concrete example is for observability, we have service A and service B emitting to topics A and B, both ingesting into a single table. Generally they are not emitting the same shape or volume of logs, which causes inaccurate segment size estimations. Even if they are emitting the same shape/size, deployments that change the log fingerprints cannot always happen in sync, so if computed at a table level and service A was upgraded and B wasn't yet, there would be a period where segment build failures are likely. Another case is that we turn on debug level logging temporarily for A, and the same issues happen.

If this is indeed needed, can we make it configurable, and add a config flag to enable it?

What's the downside of leaving this per topic by default? If we make the above assumption, that data should have the same schema/similar data distribution, then computing it at a per topic level should not be measurably different than computing it per table. If the assumption doesn't hold, then we have a better behavior by default. I feel that an extra SegmentSizeBasedFlushThresholdUpdater instance per topic should not be too expensive to hold.

Jackie-Jiang · 2025-01-14T03:12:11Z

While in most cases, there is only one topic per table (single-stream ingestion). I don't want to make size based threshold updater to be per topic by default. It will cause backward incompatibility for metrics as well.
Another drawback is slower converge time for segment size.
If you can make this only apply to multi-stream ingestion, I'm okay with that. But again, suggest making it configurable

itschrispeck · 2025-01-14T23:33:21Z

While in most cases, there is only one topic per table (single-stream ingestion). I don't want to make size based threshold updater to be per topic by default. It will cause backward incompatibility for metrics as well.
Another drawback is slower converge time for segment size.
If you can make this only apply to multi-stream ingestion, I'm okay with that. But again, suggest making it configurable

To clarify, the ask here would be to have a table based ctor and table+topic based ctor? and then use the latter for multi-stream ingestion only or if enabled? Wouldn't this add branching/make the code less maintainable for the same functionality?

For the metrics compatibility issue, I can add a second set that excludes the topic tag. The metrics systems I've used all break tags based on ., so I had hoped adding an additional tag would be generally compatible.

re: size convergence time, I feel that making the currently hardcoded weights/multipliers configurable would be a more robust way of optimizing for this

The version we're running is too old to show metrics on segment sizes, but we can see the effect of per stream sizing from the amount of mapped buffers:

edit: I pulled on-disk segment sizes from a single node and graphed them over time for a table, we see less variance and notably no more 20GB segments

chenboat · 2025-02-05T04:27:08Z

@Jackie-Jiang Do you still have concerns here? This PR looks good to me without extensive testings on Production.

Jackie-Jiang

Currently it breaks existing metrics, which is a severe backward compatibility issue.

We do want a knob to switch between per topic tracking and per table tracking for backward compatibility. If you really want to track on each topic when there are multiple streams by default, you can make it the default setting when the value is not explicitly set.

Jackie-Jiang · 2025-02-06T00:04:15Z

The main concern is on the metric name. We cannot rely on the metric system behavior. For single stream case, we should keep the behavior the same as before

itschrispeck · 2025-02-06T01:59:25Z

I've updated the metrics to emit a backwards compatible set, PTAL.

As a controller metric I feel it's cleaner to emit another gauge than to maintain an additional config that decides which metric to emit. It also makes migrating any alerting between metrics simpler.

calculate size based flush threshold per topic

c012d4f

itschrispeck added enhancement ingestion real-time labels Jan 7, 2025

Jackie-Jiang reviewed Jan 7, 2025

View reviewed changes

chenboat assigned chenboat and unassigned chenboat Feb 5, 2025

chenboat self-requested a review February 5, 2025 04:25

chenboat approved these changes Feb 5, 2025

View reviewed changes

Jackie-Jiang requested changes Feb 5, 2025

View reviewed changes

emit duplicate metrics for backwards compatibility

b3b5656

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate size based flush threshold per topic #14765

Calculate size based flush threshold per topic #14765

itschrispeck commented Jan 7, 2025 •

edited

Loading

Jackie-Jiang left a comment

codecov-commenter commented Jan 7, 2025 •

edited

Loading

itschrispeck commented Jan 7, 2025

Jackie-Jiang commented Jan 14, 2025

itschrispeck commented Jan 14, 2025 •

edited

Loading

chenboat commented Feb 5, 2025

Jackie-Jiang left a comment

Jackie-Jiang commented Feb 6, 2025

itschrispeck commented Feb 6, 2025

Calculate size based flush threshold per topic #14765

Are you sure you want to change the base?

Calculate size based flush threshold per topic #14765

Conversation

itschrispeck commented Jan 7, 2025 • edited Loading

Jackie-Jiang left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 7, 2025 • edited Loading

Codecov Report

itschrispeck commented Jan 7, 2025

Jackie-Jiang commented Jan 14, 2025

itschrispeck commented Jan 14, 2025 • edited Loading

chenboat commented Feb 5, 2025

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang commented Feb 6, 2025

itschrispeck commented Feb 6, 2025

itschrispeck commented Jan 7, 2025 •

edited

Loading

codecov-commenter commented Jan 7, 2025 •

edited

Loading

itschrispeck commented Jan 14, 2025 •

edited

Loading