pg: advanced query feats, table inspection, and stats #901

jchappelow · 2024-07-30T21:16:57Z

This at least partially resolves #800, and provides some foundational components for table statistics collection (#410). I have more work done on stats in both engine and pg for incremental updates, but the pieces in this PR were substantial, so I pulled them out for easier review.

This adds functionality to the pg package:

Table inspection: the ColumnInfo function and the ColInfo and ColType types provide a way to inspect any postgresql table without knowledge of the scheme. The (*ColInfo).ScanVal method provides an instance of a suitable type into which a column expression may be scanned in one of the new advanced query methods described below.

Advanced query execution:

the sql.QueryScanner interface is the advanced version of Execute that uses caller-provided scan values and a function to run for each scanned row:

// QueryScanner represents a type that provides the ability to execute an SQL
// statement, where for each row:
//
//  1. result values are scanned into the variables in the scans slice
//  2. the provided function is then called
//
// The function would typically capture the variables in the scans slice,
// allowing it to operator on the values. For instance, append the values to
// slices allocated by the caller, or perform reduction operations like
// sum/mean/min/etc.
//
// NOTE: This method may end up being included in the Tx interface alongside
// Executor since all of the concrete transaction implementations provided by
// this package implement this method.
type QueryScanner interface {
	QueryScanFn(ctx context.Context, stmt string,
		scans []any, fn func() error, args ...any) error
}

each transaction type in the pg package satisfies the sql.QueryScanner interface
the pg.QueryRowFunc function executes an SQL statement, handling the rows and returned values as described by the sql.QueryScanner interface.
the pg.QueryRowFuncAny is similar to pg.QueryRowFunc, except that no scan values slice is provided. The provided function is called for each row of the result. The caller does not determine the types of the Go variables in the values slice. In this way it behaves similar to Execute, but providing "for each row" semantics so that every row does not need to be loaded into memory.

Table statistics collection: beginning with a simplified sql.Statistics struct based on the types proposed in [WIP] Feature/sql exec cost estimate #603, the pg package provides the following new methods aimed at the (relatively expensive) collection of ground truth table statistics:
- RowCount provides an exact row count
- colStats computes column-wise statistics
- TableStats uses the above functions to build a *sql.Statistics for a table.
These methods are will not be used routinely. We will have incremental updates, but there are cases where a full scan may be needed to obtain the ground truth statistics.
Use the pgNumericToDecimal helper to reuse the logic to convert from pgtypes.Numeric to either our decimal.Decimal or types.Uint256 in the recent pgtype decoding added to the query helper for interpreting the values returned by row.Values() in pgx.CollectRows.

Note that this PR has three separate commits for easier review, but they can be squashed when merged.

brennanjl · 2024-07-31T18:43:49Z

common/sql/sql.go

+type QueryScanner interface {
+	QueryScanFn(ctx context.Context, stmt string,
+		scans []any, fn func() error, args ...any) error
+}


Overall seems find, but what is the reason for not doing something more conventional, similar to the standard library's sql.Rows (containing Next(), Scan(), etc.)?

Overall seems find, but what is the reason for not doing something more conventional, similar to the standard library's sql.Rows (containing Next(), Scan(), etc.)?

Basically just no real reason to introduce a rows type. The for each helpers in the pgx package were also a more intuitive building block for this method.

internal/sql/pg/stats.go

brennanjl · 2024-07-31T19:16:45Z

internal/sql/pg/system.go

+	var pos int
+	var domainName pgtype.Text // null in Valid bool
+	var colName, dataType, typeOrArray, isNullable string
+	var colDefault any


This default is scanned out, but it will probably be pretty confusing to a consumer of this information. In the ColInfo struct, it is of type any, however a consumer would probably think that it would be one of Kwil's types (e.g. our numeric type). However, a numeric will actually be one of pgx's.

It seems like this (or the queryRowFunc) is a place that we could rely on the *datatype struct to convert this, using the (*datatype.Decode()) method.

Alternatively, making the *pg.ColInfo struct unexported could make sense. Really, what I am getting at here is that the consumer of this package is left with ambiguity as to whether things are converted to be "Kwil native", or if they are "pgx native"

I agree that exporting this colDefault is unhelpful at best and problematic for the reasons you mentioned. I didn't give this field much thought, really it was just for completeness but it's not helpful outside this package without more constrained types.

I'll almost certainly unexport things rather than cast ColInfo and it's fields in terms of Kwil types. The *datatype and it's Decode method are designed to address a different problem -- interpreting the any from the row.Values() return in pgx.CollectRows. In this work we want to dictate the scan value's type so that it's predicatable and easy to work with, no fuss with OIDs and ambiguity about what type pgx might choose if left to its own devices (or postgres' in the case of domains). The types from the ScanVal method are quite tightly coupled to the (unexported) colStats.

The package-level TableStats function is the only thing that really uses these or really needs to be exported. I think some of the table inspection code is useful more generally, but the fields and methods that return instances of types are not great to export.

brennanjl · 2024-07-31T19:28:48Z

internal/sql/pg/system.go

+func queryRowFunc(ctx context.Context, conn *pgx.Conn, sql string,
+	scans []any, fn func() error, args ...any) error {
+	rows, _ := conn.Query(ctx, sql, args...)
+	_, err := pgx.ForEachRow(rows, scans, fn)


We should add better tests for scanning to different types, because I am fairly confident they will not work. For example, I don't think Uint256Array's Scan() implementation will work, because it only checks for a []string type, and I would expected pgx would give us either []any or []pgtype.UUID.

We should add better tests for scanning to different types, because I am fairly confident they will not work. For example, I don't think Uint256Array's Scan() implementation will work, because it only checks for a []string type, and I would expected pgx would give us either []any or []pgtype.UUID.

Yeah, I will add some tests using the scanners. But an sql.Scanner will always be given a primitive type. That's pretty much the purpose of such a type. (Note the types in pgtypes are sql.Scanners themselves). pgx is just smart enough to recognize when it's scanning into a sql.Scanner and skip past it's own "smart" codecs, so afaik we are sidestepping all that complexity and dealing with lcd types.

That said, I have no clue if []string is right for a Uint256Array. It's probably not. 😆

Will test this a bit.

For reference, the std lib docs expand on this:

// Scanner is an interface used by [Rows.Scan]. type Scanner interface { // Scan assigns a value from a database driver. // // The src value will be of one of the following types: // // int64 // float64 // bool // []byte // string // time.Time // nil - for NULL values // // An error should be returned if the value cannot be stored // without loss of information. // // Reference types such as []byte are only valid until the next call to Scan // and should not be retained. Their underlying memory is owned by the driver. // If retention is necessary, copy their values before the next call to Scan. Scan(src any) error }

brennanjl · 2024-07-31T19:30:42Z

internal/sql/pg/system.go

+
+// FieldDesc describes result value from a query. This is used to convey
+// information on the values passed to the closure given to QueryRowFuncAny.
+type FieldDesc struct {


Unexport? This gives the caller access to the OID, which for the uint256 domain, will be different on every system. Would hate for someone to forget this and rely on this functionality externally later

Ah, yes! So, the motivation is poorly explained, but this pertains to the main similarity between QueryRowFuncAny and the current Execute/query functions, which also use row.Values(). Essentially QueryRowFuncAny should behave like query but does something for each row rather than loading all rows into memory.

This FieldDesc relates to the above since it is an argument to the function that is executed for each row.

So yeah, both FieldDesc and QueryRowFuncAny in their current forms should be unexported. Any exported QueryRowFuncAny should not trouble the user with OIDs and such, and should ensure the []any contains the same types of values that the exported Execute methods return. (needs to use the registered datatypes + decodeFromPG => the Decode methods)

brennanjl · 2024-07-31T19:33:23Z

internal/sql/pg/system.go

+// appropriate Go type to scan a row containing this column type in an SQL
+// statement.
+type ColInfo struct {
+	Pos      int


As mentioned in another comment, I think we should do one of two things:

Make the data type uniform with Kwil data types (*types.DataType), and make the Default a Kwil type as well (instead of a pgx type), since this is what other functions in this package return.

Make this struct unexported, in which case it doesn't matter that it doesn't follow the same semantics as the rest of the package.

We could also just get rid of Default, if we aren't relying on it

Commented in https://github.com/kwilteam/kwil-db/pull/901/files/a62d42727c0574985c662cd8ae94390810691ea5#r1699049173, but regarding ColInfo, as the doc comment says, this is really ingesting column descriptions from PostgreSQL, and the Type() method is intended to return the canonical enumerator of the type. This is fine as it's well defined and just an enum, but things do get problematic with the ScanVal() any method as there's really nothing useful a consumer (outside the pg package) can do with the value which is just a type that other functions in the pg package want/need. (You're 💯 correct to note that most of this table inspection code is designed to power other functionality internal to pg.)

jchappelow · 2024-08-01T22:36:19Z

Getting uint256 and decimal and their array types scanning and valueing right was way more painful than expected.
There are some incomplete tests I will get to tonight, but this PR is mostly updated according to the review comments now.
The easier update from yesterday is that QueryRowFuncAny now converts to kwil types like the existing query that returns all the rows. Also unexported a bunch of stuff that would not be useful outside of pg.

jchappelow · 2024-08-01T22:38:27Z

internal/sql/pg/query.go

+		for i, pgVal := range pgxVals {
+			decVal, err := decodeFromPGVal(pgVal, oids[i], oidTypes)


Since I moved queryRowFuncAny to a differnt file, I'll call out this key change, which is converting the types to Kwil types according to our oid type map.

jchappelow · 2024-08-01T22:42:27Z

internal/sql/pg/system.go

 // ColumnInfo attempts to describe the columns of a table in a specified
-// PostgreSQL schema.
+// PostgreSQL schema. The results are **as reported by information_schema.column**.
 //
 // If the provided sql.Executor is also a ColumnInfoer, its ColumnInfo method
 // will be used. This is primarily for testing with a mocked DB transaction.
 // Otherwise, the Executor must be one of the transaction types created by this
 // package, which provide access to the underlying DB connection.
 func ColumnInfo(ctx context.Context, tx sql.Executor, schema, tbl string) ([]ColInfo, error) {


Exported functions don't take any ColInfo, only return for general information purposes. Docs for ColInfo elaborate. This ColumnInfo won't be need needed outside of the pg packages, but I find it generally useful, so didn't go so far as to unexport it or ColInfo, but it wouldn't derail any of the work in engine or elsewhere if we did.

brennanjl

other than the commented code lgtm

core/types/uint256.go

This adds the ability to scan query results into provided variables instead of relying on pgx Row.Values() to choose the type. This provides some foundational components for table statistics collection. The sql.QueryScanner interface is the advanced version of Execute that uses caller-provided scan values and a function to run for each scanned row: // QueryScanner represents a type that provides the ability to execute an SQL // statement, where for each row: // // 1. result values are scanned into the variables in the scans slice // 2. the provided function is then called // // The function would typically capture the variables in the scans slice, // allowing it to operator on the values. For instance, append the values to // slices allocated by the caller, or perform reduction operations like // sum/mean/min/etc. // // NOTE: This method may end up being included in the Tx interface alongside // Executor since all of the concrete transaction implementations provided by // this package implement this method. type QueryScanner interface { QueryScanFn(ctx context.Context, stmt string, scans []any, fn func() error, args ...any) error } Each transaction type in the pg package satisfies the sql.QueryScanner interface. The pg.QueryRowFunc function executes an SQL statement, handling the rows and returned values as described by the sql.QueryScanner interface. The pg.QueryRowFuncAny is similar to pg.QueryRowFunc, except that no scan values slice is provided. The provided function is called for each row of the result. The caller does not determine the types of the Go variables in the values slice. In this way it behaves similar to Execute, but providing "for each row" semantics so that every row does not need to be loaded into memory. Table statistics collection: beginning with a simplified sql.Statistics struct based on the types proposed in the initial unmerged query cost branch, the pg package provides the following new methods aimed at the (relatively expensive) collection of ground truth table statistics: - RowCount provides an exact row count - colStats computes column-wise statistics - TableStats uses the above functions to build a *sql.Statistics for a table. These methods are will not be used routinely. We will have incremental updates, but there are cases where a full scan may be needed to obtain the ground truth statistics. pg: decimal and uint256 use pgNumericToDecimal helper Use the pgNumericToDecimal helper to reuse the logic to convert from pgtypes.Numeric to either our decimal.Decimal or types.Uint256 in the recent pgtype decoding added to the query helper for interpreting the values returned by row.Values() in pgx.CollectRows. types,decimal: sql scan/value for uint256 and decimal and arrays nulls with uint256 and decimal deps: update pgx module from 5.5.5 to 5.6.0

brennanjl reviewed Jul 31, 2024

View reviewed changes

jchappelow force-pushed the pg-scanval-colinfo branch 4 times, most recently from b64cc5e to 2108065 Compare August 1, 2024 17:16

jchappelow commented Aug 1, 2024

View reviewed changes

jchappelow force-pushed the pg-scanval-colinfo branch from 98d0dd8 to 221b31e Compare August 2, 2024 14:19

brennanjl previously approved these changes Aug 7, 2024

View reviewed changes

core/types/uint256.go Outdated Show resolved Hide resolved

jchappelow dismissed brennanjl’s stale review via 85a82e9 August 9, 2024 21:27

jchappelow force-pushed the pg-scanval-colinfo branch from af24b58 to 85a82e9 Compare August 9, 2024 21:27

brennanjl approved these changes Aug 12, 2024

View reviewed changes

jchappelow merged commit 3aecf89 into kwilteam:main Aug 12, 2024
2 checks passed

jchappelow deleted the pg-scanval-colinfo branch August 12, 2024 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pg: advanced query feats, table inspection, and stats #901

pg: advanced query feats, table inspection, and stats #901

jchappelow commented Jul 30, 2024

brennanjl Jul 31, 2024

jchappelow Jul 31, 2024

brennanjl Jul 31, 2024

jchappelow Jul 31, 2024 •

edited

Loading

brennanjl Jul 31, 2024

jchappelow Jul 31, 2024 •

edited

Loading

brennanjl Jul 31, 2024

jchappelow Jul 31, 2024

brennanjl Jul 31, 2024

brennanjl Jul 31, 2024

jchappelow Jul 31, 2024

jchappelow commented Aug 1, 2024

jchappelow Aug 1, 2024 •

edited

Loading

jchappelow Aug 1, 2024 •

edited

Loading

brennanjl left a comment

		for i, pgVal := range pgxVals {
		decVal, err := decodeFromPGVal(pgVal, oids[i], oidTypes)

pg: advanced query feats, table inspection, and stats #901

pg: advanced query feats, table inspection, and stats #901

Conversation

jchappelow commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jchappelow Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jchappelow Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jchappelow commented Aug 1, 2024

jchappelow Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

jchappelow Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

brennanjl left a comment

Choose a reason for hiding this comment

jchappelow Jul 31, 2024 •

edited

Loading

jchappelow Jul 31, 2024 •

edited

Loading

jchappelow Aug 1, 2024 •

edited

Loading

jchappelow Aug 1, 2024 •

edited

Loading