statistics foudation -- MCVs and histograms for all types #929
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This provides statistics collection functionality, which is integrated in https://github.com/jchappelow/kwil-db/commits/stats-engine-start with a simple periodic refresh and persistence.
This PR adds MCV and histogram components to the
ColumnStatistics
struct, and the insert/update/delete functions in both full table scan and incremental updates. This work is the foundation for computing selectivity given a column, an inequality, and a value.Another branch that builds on this begins maintaining statistics as described in #927, namely doing: (1) persistence of each datasets column statistics, and (2) periodic refresh of statistics to deal with divergence from ground truth as table contents change. This is the reason for the type serialization changes in this PR, as well as the new
EncStats
andDecStats
functions. The blob representing a table's statistics may be stored in a table or simply to disk, depending on the required crash recovery guarantees (atomic writes would avoid needing a state rebuild in the event of a poorly timed crash when finalizing a block).There is considerable awkwardness in the MCV and histo code as there is a mix of generics and interfaces (
any
fields and values). I'm not certain this is the best way, and we might try to redefine the ColumnStatistics as a generic itself, but it creates issues in other places. I'm primarily concerned with the mathy logic of building and maintaining the stats. I am very much not set on the current approach for dealing with all polymorphic types, so please feel free to suggest and describe an simpler approach that may have escaped me. AFAICT, when dealing with generics, you end up with a quite a few.(type)
switches in order to instantiate concrete instances of types of functions. Slices further limit the options.