[improvement] Some performance related changes to evaluate #1426
+65
−10
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Based on some CPU and memory analysis we ran on our production systems, we noticed a few things that could be improved.
I'd like to present these and discuss whether it makes sense to merge these into the project (otherwise we'll likely keep these on a local fork). Also some changes perhaps need a bit of work or could be done in another way. All feedback welcome!
Allow String column provider to take the name, so that it can use different settings
For
String
column types its possible to define aColStrProvider
, but this can only be defined once and globally. WhileColStrProvider
is great for pre-sizing buffers when creating batches, you can imagine not every column needs as much space. Some may be limited to a few characters, while others (e.g. log bodies) can be hundreds of characters.Adding the name of the column as input to the
ColStrProvider
allows the internal logic to size buffers differently, or maybe include other logic.This change does not affect users of the API.
Prevent hashing enums just for checking validity
We are creating quite big batches, and have quite a few
enum
s in our records. In CPU profiles we noticed a lot of time was spent in map lookups, just for adding the enum to the batch. This is done to check whether the enum is valid, since they don't need to be a continuous range in ClickHouse.The optimization takes a bit of overhead up-front, by checking if the Enum definition is a continuous range and capturing the lower and upper bound. Because if continuous, then for validating we only need to check if the number falls between lower and upper bound. This simple boolean logic is much faster overall than the map lookup.
Allow tuple type to prevent array creation overhead
This is probably the change with the least impact, though perhaps could be improved. Because normally when using
Tuple
type, the values need to be inserted asArray
/Slice
type. But depending on how data is internally presented before inserting, it might mean new slices need to be allocated.So I'm wondering if a specific
Tuple
type makes sense, with specific types for the most common lengths (likeTuple2
,Tuple3
andTuple4
).Usage would look something like this:
Checklist
Delete items not relevant to your PR: