Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics foudation -- MCVs and histograms for all types #929

Closed
wants to merge 6 commits into from

Conversation

jchappelow
Copy link
Member

@jchappelow jchappelow commented Aug 23, 2024

This provides statistics collection functionality, which is integrated in https://github.com/jchappelow/kwil-db/commits/stats-engine-start with a simple periodic refresh and persistence.


This PR adds MCV and histogram components to the ColumnStatistics struct, and the insert/update/delete functions in both full table scan and incremental updates. This work is the foundation for computing selectivity given a column, an inequality, and a value.

Another branch that builds on this begins maintaining statistics as described in #927, namely doing: (1) persistence of each datasets column statistics, and (2) periodic refresh of statistics to deal with divergence from ground truth as table contents change. This is the reason for the type serialization changes in this PR, as well as the new EncStats and DecStats functions. The blob representing a table's statistics may be stored in a table or simply to disk, depending on the required crash recovery guarantees (atomic writes would avoid needing a state rebuild in the event of a poorly timed crash when finalizing a block).

There is considerable awkwardness in the MCV and histo code as there is a mix of generics and interfaces (any fields and values). I'm not certain this is the best way, and we might try to redefine the ColumnStatistics as a generic itself, but it creates issues in other places. I'm primarily concerned with the mathy logic of building and maintaining the stats. I am very much not set on the current approach for dealing with all polymorphic types, so please feel free to suggest and describe an simpler approach that may have escaped me. AFAICT, when dealing with generics, you end up with a quite a few .(type) switches in order to instantiate concrete instances of types of functions. Slices further limit the options.

@jchappelow jchappelow force-pushed the stats-mcvs-histo branch 8 times, most recently from c1dab29 to 6feac79 Compare August 27, 2024 13:38
@jchappelow
Copy link
Member Author

The integration branch that uses/proves this code is now broken, either due to something on main or something that I changed since Friday. Reluctantly making this ready for review anyway since I don't expect large changes to fix it, and it may not even be in this code.

@jchappelow jchappelow marked this pull request as ready for review August 27, 2024 13:40
@jchappelow
Copy link
Member Author

Some of these test are atrocious and there is UUID vs *UUID breakage in multiple places. :( fixing

@jchappelow jchappelow force-pushed the stats-mcvs-histo branch 2 times, most recently from a15b78c to 7c58dc8 Compare September 3, 2024 14:27
@jchappelow jchappelow added this to the v0.10 milestone Sep 16, 2024
@jchappelow
Copy link
Member Author

There are a few cherries to pick out of this, but overall we don't need the changes now. Closing.

@jchappelow jchappelow closed this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant