Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg: advanced query feats, table inspection, and stats #901

Merged
merged 1 commit into from
Aug 12, 2024

Conversation

jchappelow
Copy link
Member

This at least partially resolves #800, and provides some foundational components for table statistics collection (#410). I have more work done on stats in both engine and pg for incremental updates, but the pieces in this PR were substantial, so I pulled them out for easier review.

This adds functionality to the pg package:

  • Table inspection: the ColumnInfo function and the ColInfo and ColType types provide a way to inspect any postgresql table without knowledge of the scheme. The (*ColInfo).ScanVal method provides an instance of a suitable type into which a column expression may be scanned in one of the new advanced query methods described below.

  • Advanced query execution:

    • the sql.QueryScanner interface is the advanced version of Execute that uses caller-provided scan values and a function to run for each scanned row:

      // QueryScanner represents a type that provides the ability to execute an SQL
      // statement, where for each row:
      //
      //  1. result values are scanned into the variables in the scans slice
      //  2. the provided function is then called
      //
      // The function would typically capture the variables in the scans slice,
      // allowing it to operator on the values. For instance, append the values to
      // slices allocated by the caller, or perform reduction operations like
      // sum/mean/min/etc.
      //
      // NOTE: This method may end up being included in the Tx interface alongside
      // Executor since all of the concrete transaction implementations provided by
      // this package implement this method.
      type QueryScanner interface {
      	QueryScanFn(ctx context.Context, stmt string,
      		scans []any, fn func() error, args ...any) error
      }
    • each transaction type in the pg package satisfies the sql.QueryScanner interface

    • the pg.QueryRowFunc function executes an SQL statement, handling the rows and returned values as described by the sql.QueryScanner interface.

    • the pg.QueryRowFuncAny is similar to pg.QueryRowFunc, except that no scan values slice is provided. The provided function is called for each row of the result. The caller does not determine the types of the Go variables in the values slice. In this way it behaves similar to Execute, but providing "for each row" semantics so that every row does not need to be loaded into memory.

  • Table statistics collection: beginning with a simplified sql.Statistics struct based on the types proposed in [WIP] Feature/sql exec cost estimate #603, the pg package provides the following new methods aimed at the (relatively expensive) collection of ground truth table statistics:

    • RowCount provides an exact row count
    • colStats computes column-wise statistics
    • TableStats uses the above functions to build a *sql.Statistics for a table.

    These methods are will not be used routinely. We will have incremental updates, but there are cases where a full scan may be needed to obtain the ground truth statistics.

  • Use the pgNumericToDecimal helper to reuse the logic to convert from pgtypes.Numeric to either our decimal.Decimal or types.Uint256 in the recent pgtype decoding added to the query helper for interpreting the values returned by row.Values() in pgx.CollectRows.

Note that this PR has three separate commits for easier review, but they can be squashed when merged.

type QueryScanner interface {
QueryScanFn(ctx context.Context, stmt string,
scans []any, fn func() error, args ...any) error
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems find, but what is the reason for not doing something more conventional, similar to the standard library's sql.Rows (containing Next(), Scan(), etc.)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems find, but what is the reason for not doing something more conventional, similar to the standard library's sql.Rows (containing Next(), Scan(), etc.)?

Basically just no real reason to introduce a rows type. The for each helpers in the pgx package were also a more intuitive building block for this method.

internal/sql/pg/stats.go Outdated Show resolved Hide resolved
internal/sql/pg/stats.go Outdated Show resolved Hide resolved
var pos int
var domainName pgtype.Text // null in Valid bool
var colName, dataType, typeOrArray, isNullable string
var colDefault any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default is scanned out, but it will probably be pretty confusing to a consumer of this information. In the ColInfo struct, it is of type any, however a consumer would probably think that it would be one of Kwil's types (e.g. our numeric type). However, a numeric will actually be one of pgx's.

It seems like this (or the queryRowFunc) is a place that we could rely on the *datatype struct to convert this, using the (*datatype.Decode()) method.

Alternatively, making the *pg.ColInfo struct unexported could make sense. Really, what I am getting at here is that the consumer of this package is left with ambiguity as to whether things are converted to be "Kwil native", or if they are "pgx native"

Copy link
Member Author

@jchappelow jchappelow Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that exporting this colDefault is unhelpful at best and problematic for the reasons you mentioned. I didn't give this field much thought, really it was just for completeness but it's not helpful outside this package without more constrained types.

I'll almost certainly unexport things rather than cast ColInfo and it's fields in terms of Kwil types. The *datatype and it's Decode method are designed to address a different problem -- interpreting the any from the row.Values() return in pgx.CollectRows. In this work we want to dictate the scan value's type so that it's predicatable and easy to work with, no fuss with OIDs and ambiguity about what type pgx might choose if left to its own devices (or postgres' in the case of domains). The types from the ScanVal method are quite tightly coupled to the (unexported) colStats.

The package-level TableStats function is the only thing that really uses these or really needs to be exported. I think some of the table inspection code is useful more generally, but the fields and methods that return instances of types are not great to export.

func queryRowFunc(ctx context.Context, conn *pgx.Conn, sql string,
scans []any, fn func() error, args ...any) error {
rows, _ := conn.Query(ctx, sql, args...)
_, err := pgx.ForEachRow(rows, scans, fn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add better tests for scanning to different types, because I am fairly confident they will not work. For example, I don't think Uint256Array's Scan() implementation will work, because it only checks for a []string type, and I would expected pgx would give us either []any or []pgtype.UUID.

Copy link
Member Author

@jchappelow jchappelow Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add better tests for scanning to different types, because I am fairly confident they will not work. For example, I don't think Uint256Array's Scan() implementation will work, because it only checks for a []string type, and I would expected pgx would give us either []any or []pgtype.UUID.

Yeah, I will add some tests using the scanners. But an sql.Scanner will always be given a primitive type. That's pretty much the purpose of such a type. (Note the types in pgtypes are sql.Scanners themselves). pgx is just smart enough to recognize when it's scanning into a sql.Scanner and skip past it's own "smart" codecs, so afaik we are sidestepping all that complexity and dealing with lcd types.

That said, I have no clue if []string is right for a Uint256Array. It's probably not. 😆

Will test this a bit.

For reference, the std lib docs expand on this:

// Scanner is an interface used by [Rows.Scan].
type Scanner interface {
	// Scan assigns a value from a database driver.
	//
	// The src value will be of one of the following types:
	//
	//    int64
	//    float64
	//    bool
	//    []byte
	//    string
	//    time.Time
	//    nil - for NULL values
	//
	// An error should be returned if the value cannot be stored
	// without loss of information.
	//
	// Reference types such as []byte are only valid until the next call to Scan
	// and should not be retained. Their underlying memory is owned by the driver.
	// If retention is necessary, copy their values before the next call to Scan.
	Scan(src any) error
}


// FieldDesc describes result value from a query. This is used to convey
// information on the values passed to the closure given to QueryRowFuncAny.
type FieldDesc struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexport? This gives the caller access to the OID, which for the uint256 domain, will be different on every system. Would hate for someone to forget this and rely on this functionality externally later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes! So, the motivation is poorly explained, but this pertains to the main similarity between QueryRowFuncAny and the current Execute/query functions, which also use row.Values(). Essentially QueryRowFuncAny should behave like query but does something for each row rather than loading all rows into memory.

This FieldDesc relates to the above since it is an argument to the function that is executed for each row.

So yeah, both FieldDesc and QueryRowFuncAny in their current forms should be unexported. Any exported QueryRowFuncAny should not trouble the user with OIDs and such, and should ensure the []any contains the same types of values that the exported Execute methods return. (needs to use the registered datatypes + decodeFromPG => the Decode methods)

// appropriate Go type to scan a row containing this column type in an SQL
// statement.
type ColInfo struct {
Pos int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in another comment, I think we should do one of two things:

  1. Make the data type uniform with Kwil data types (*types.DataType), and make the Default a Kwil type as well (instead of a pgx type), since this is what other functions in this package return.
  2. Make this struct unexported, in which case it doesn't matter that it doesn't follow the same semantics as the rest of the package.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just get rid of Default, if we aren't relying on it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented in https://github.com/kwilteam/kwil-db/pull/901/files/a62d42727c0574985c662cd8ae94390810691ea5#r1699049173, but regarding ColInfo, as the doc comment says, this is really ingesting column descriptions from PostgreSQL, and the Type() method is intended to return the canonical enumerator of the type. This is fine as it's well defined and just an enum, but things do get problematic with the ScanVal() any method as there's really nothing useful a consumer (outside the pg package) can do with the value which is just a type that other functions in the pg package want/need. (You're 💯 correct to note that most of this table inspection code is designed to power other functionality internal to pg.)

@jchappelow jchappelow force-pushed the pg-scanval-colinfo branch 4 times, most recently from b64cc5e to 2108065 Compare August 1, 2024 17:16
@jchappelow
Copy link
Member Author

Getting uint256 and decimal and their array types scanning and valueing right was way more painful than expected.
There are some incomplete tests I will get to tonight, but this PR is mostly updated according to the review comments now.
The easier update from yesterday is that QueryRowFuncAny now converts to kwil types like the existing query that returns all the rows. Also unexported a bunch of stuff that would not be useful outside of pg.

Comment on lines +301 to +302
for i, pgVal := range pgxVals {
decVal, err := decodeFromPGVal(pgVal, oids[i], oidTypes)
Copy link
Member Author

@jchappelow jchappelow Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I moved queryRowFuncAny to a differnt file, I'll call out this key change, which is converting the types to Kwil types according to our oid type map.

// ColumnInfo attempts to describe the columns of a table in a specified
// PostgreSQL schema.
// PostgreSQL schema. The results are **as reported by information_schema.column**.
//
// If the provided sql.Executor is also a ColumnInfoer, its ColumnInfo method
// will be used. This is primarily for testing with a mocked DB transaction.
// Otherwise, the Executor must be one of the transaction types created by this
// package, which provide access to the underlying DB connection.
func ColumnInfo(ctx context.Context, tx sql.Executor, schema, tbl string) ([]ColInfo, error) {
Copy link
Member Author

@jchappelow jchappelow Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported functions don't take any ColInfo, only return for general information purposes. Docs for ColInfo elaborate. This ColumnInfo won't be need needed outside of the pg packages, but I find it generally useful, so didn't go so far as to unexport it or ColInfo, but it wouldn't derail any of the work in engine or elsewhere if we did.

brennanjl
brennanjl previously approved these changes Aug 7, 2024
Copy link
Collaborator

@brennanjl brennanjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than the commented code lgtm

core/types/uint256.go Outdated Show resolved Hide resolved
This adds the ability to scan query results into provided variables
instead of relying on pgx Row.Values() to choose the type. This provides
some foundational components for table statistics collection.

The sql.QueryScanner interface is the advanced version of Execute that
uses caller-provided scan values and a function to run for each scanned
row:

// QueryScanner represents a type that provides the ability to execute an SQL
// statement, where for each row:
//
//  1. result values are scanned into the variables in the scans slice
//  2. the provided function is then called
//
// The function would typically capture the variables in the scans slice,
// allowing it to operator on the values. For instance, append the values to
// slices allocated by the caller, or perform reduction operations like
// sum/mean/min/etc.
//
// NOTE: This method may end up being included in the Tx interface alongside
// Executor since all of the concrete transaction implementations provided by
// this package implement this method.
type QueryScanner interface {
	QueryScanFn(ctx context.Context, stmt string,
		scans []any, fn func() error, args ...any) error
}

Each transaction type in the pg package satisfies the sql.QueryScanner
interface.

The pg.QueryRowFunc function executes an SQL statement, handling the rows
and returned values as described by the sql.QueryScanner interface.

The pg.QueryRowFuncAny is similar to pg.QueryRowFunc, except that no scan
values slice is provided. The provided function is called for each row
of the result. The caller does not determine the types of the Go variables
in the values slice. In this way it behaves similar to Execute, but
providing "for each row" semantics so that every row does not need to
be loaded into memory.

Table statistics collection: beginning with a simplified sql.Statistics
struct based on the types proposed in the initial unmerged query cost
branch, the pg package provides the following new methods aimed at the
(relatively expensive) collection of ground truth table statistics:

  - RowCount provides an exact row count
  - colStats computes column-wise statistics
  - TableStats uses the above functions to build a *sql.Statistics for a table.

These methods are will not be used routinely. We will have incremental
updates, but there are cases where a full scan may be needed to obtain
the ground truth statistics.

pg: decimal and uint256 use pgNumericToDecimal helper

Use the pgNumericToDecimal helper to reuse the logic to convert from
pgtypes.Numeric to either our decimal.Decimal or types.Uint256 in the
recent pgtype decoding added to the query helper for interpreting the
values returned by row.Values() in pgx.CollectRows.

types,decimal: sql scan/value for uint256 and decimal and arrays

nulls with uint256 and decimal

deps: update pgx module from 5.5.5 to 5.6.0
@jchappelow jchappelow merged commit 3aecf89 into kwilteam:main Aug 12, 2024
2 checks passed
@jchappelow jchappelow deleted the pg-scanval-colinfo branch August 12, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

provide an advanced DB interface for internals
2 participants