-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pg: advanced query feats, table inspection, and stats #901
Conversation
type QueryScanner interface { | ||
QueryScanFn(ctx context.Context, stmt string, | ||
scans []any, fn func() error, args ...any) error | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems find, but what is the reason for not doing something more conventional, similar to the standard library's sql.Rows
(containing Next()
, Scan()
, etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems find, but what is the reason for not doing something more conventional, similar to the standard library's
sql.Rows
(containingNext()
,Scan()
, etc.)?
Basically just no real reason to introduce a rows type. The for each helpers in the pgx package were also a more intuitive building block for this method.
var pos int | ||
var domainName pgtype.Text // null in Valid bool | ||
var colName, dataType, typeOrArray, isNullable string | ||
var colDefault any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default is scanned out, but it will probably be pretty confusing to a consumer of this information. In the ColInfo
struct, it is of type any
, however a consumer would probably think that it would be one of Kwil's types (e.g. our numeric type). However, a numeric will actually be one of pgx's.
It seems like this (or the queryRowFunc
) is a place that we could rely on the *datatype
struct to convert this, using the (*datatype.Decode())
method.
Alternatively, making the *pg.ColInfo
struct unexported could make sense. Really, what I am getting at here is that the consumer of this package is left with ambiguity as to whether things are converted to be "Kwil native", or if they are "pgx native"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that exporting this colDefault
is unhelpful at best and problematic for the reasons you mentioned. I didn't give this field much thought, really it was just for completeness but it's not helpful outside this package without more constrained types.
I'll almost certainly unexport things rather than cast ColInfo
and it's fields in terms of Kwil types. The *datatype
and it's Decode
method are designed to address a different problem -- interpreting the any
from the row.Values()
return in pgx.CollectRows
. In this work we want to dictate the scan value's type so that it's predicatable and easy to work with, no fuss with OIDs and ambiguity about what type pgx might choose if left to its own devices (or postgres' in the case of domains). The types from the ScanVal method are quite tightly coupled to the (unexported) colStats
.
The package-level TableStats
function is the only thing that really uses these or really needs to be exported. I think some of the table inspection code is useful more generally, but the fields and methods that return instances of types are not great to export.
internal/sql/pg/system.go
Outdated
func queryRowFunc(ctx context.Context, conn *pgx.Conn, sql string, | ||
scans []any, fn func() error, args ...any) error { | ||
rows, _ := conn.Query(ctx, sql, args...) | ||
_, err := pgx.ForEachRow(rows, scans, fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add better tests for scanning to different types, because I am fairly confident they will not work. For example, I don't think Uint256Array
's Scan()
implementation will work, because it only checks for a []string
type, and I would expected pgx would give us either []any
or []pgtype.UUID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add better tests for scanning to different types, because I am fairly confident they will not work. For example, I don't think
Uint256Array
'sScan()
implementation will work, because it only checks for a[]string
type, and I would expected pgx would give us either[]any
or[]pgtype.UUID
.
Yeah, I will add some tests using the scanners. But an sql.Scanner
will always be given a primitive type. That's pretty much the purpose of such a type. (Note the types in pgtypes
are sql.Scanner
s themselves). pgx is just smart enough to recognize when it's scanning into a sql.Scanner
and skip past it's own "smart" codecs, so afaik we are sidestepping all that complexity and dealing with lcd types.
That said, I have no clue if []string
is right for a Uint256Array
. It's probably not. 😆
Will test this a bit.
For reference, the std lib docs expand on this:
// Scanner is an interface used by [Rows.Scan].
type Scanner interface {
// Scan assigns a value from a database driver.
//
// The src value will be of one of the following types:
//
// int64
// float64
// bool
// []byte
// string
// time.Time
// nil - for NULL values
//
// An error should be returned if the value cannot be stored
// without loss of information.
//
// Reference types such as []byte are only valid until the next call to Scan
// and should not be retained. Their underlying memory is owned by the driver.
// If retention is necessary, copy their values before the next call to Scan.
Scan(src any) error
}
internal/sql/pg/system.go
Outdated
|
||
// FieldDesc describes result value from a query. This is used to convey | ||
// information on the values passed to the closure given to QueryRowFuncAny. | ||
type FieldDesc struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unexport? This gives the caller access to the OID, which for the uint256 domain, will be different on every system. Would hate for someone to forget this and rely on this functionality externally later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes! So, the motivation is poorly explained, but this pertains to the main similarity between QueryRowFuncAny
and the current Execute
/query
functions, which also use row.Values()
. Essentially QueryRowFuncAny
should behave like query
but does something for each row rather than loading all rows into memory.
This FieldDesc
relates to the above since it is an argument to the function that is executed for each row.
So yeah, both FieldDesc
and QueryRowFuncAny
in their current forms should be unexported. Any exported QueryRowFuncAny
should not trouble the user with OIDs and such, and should ensure the []any
contains the same types of values that the exported Execute
methods return. (needs to use the registered datatype
s + decodeFromPG
=> the Decode
methods)
internal/sql/pg/system.go
Outdated
// appropriate Go type to scan a row containing this column type in an SQL | ||
// statement. | ||
type ColInfo struct { | ||
Pos int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in another comment, I think we should do one of two things:
- Make the data type uniform with Kwil data types (*types.DataType), and make the
Default
a Kwil type as well (instead of a pgx type), since this is what other functions in this package return. - Make this struct unexported, in which case it doesn't matter that it doesn't follow the same semantics as the rest of the package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just get rid of Default
, if we aren't relying on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented in https://github.com/kwilteam/kwil-db/pull/901/files/a62d42727c0574985c662cd8ae94390810691ea5#r1699049173, but regarding ColInfo
, as the doc comment says, this is really ingesting column descriptions from PostgreSQL
, and the Type()
method is intended to return the canonical enumerator of the type. This is fine as it's well defined and just an enum, but things do get problematic with the ScanVal() any
method as there's really nothing useful a consumer (outside the pg
package) can do with the value which is just a type that other functions in the pg
package want/need. (You're 💯 correct to note that most of this table inspection code is designed to power other functionality internal to pg
.)
b64cc5e
to
2108065
Compare
Getting uint256 and decimal and their array types scanning and valueing right was way more painful than expected. |
for i, pgVal := range pgxVals { | ||
decVal, err := decodeFromPGVal(pgVal, oids[i], oidTypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I moved queryRowFuncAny
to a differnt file, I'll call out this key change, which is converting the types to Kwil types according to our oid type map.
// ColumnInfo attempts to describe the columns of a table in a specified | ||
// PostgreSQL schema. | ||
// PostgreSQL schema. The results are **as reported by information_schema.column**. | ||
// | ||
// If the provided sql.Executor is also a ColumnInfoer, its ColumnInfo method | ||
// will be used. This is primarily for testing with a mocked DB transaction. | ||
// Otherwise, the Executor must be one of the transaction types created by this | ||
// package, which provide access to the underlying DB connection. | ||
func ColumnInfo(ctx context.Context, tx sql.Executor, schema, tbl string) ([]ColInfo, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exported functions don't take any ColInfo
, only return for general information purposes. Docs for ColInfo
elaborate. This ColumnInfo
won't be need needed outside of the pg
packages, but I find it generally useful, so didn't go so far as to unexport it or ColInfo
, but it wouldn't derail any of the work in engine or elsewhere if we did.
98d0dd8
to
221b31e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than the commented code lgtm
This adds the ability to scan query results into provided variables instead of relying on pgx Row.Values() to choose the type. This provides some foundational components for table statistics collection. The sql.QueryScanner interface is the advanced version of Execute that uses caller-provided scan values and a function to run for each scanned row: // QueryScanner represents a type that provides the ability to execute an SQL // statement, where for each row: // // 1. result values are scanned into the variables in the scans slice // 2. the provided function is then called // // The function would typically capture the variables in the scans slice, // allowing it to operator on the values. For instance, append the values to // slices allocated by the caller, or perform reduction operations like // sum/mean/min/etc. // // NOTE: This method may end up being included in the Tx interface alongside // Executor since all of the concrete transaction implementations provided by // this package implement this method. type QueryScanner interface { QueryScanFn(ctx context.Context, stmt string, scans []any, fn func() error, args ...any) error } Each transaction type in the pg package satisfies the sql.QueryScanner interface. The pg.QueryRowFunc function executes an SQL statement, handling the rows and returned values as described by the sql.QueryScanner interface. The pg.QueryRowFuncAny is similar to pg.QueryRowFunc, except that no scan values slice is provided. The provided function is called for each row of the result. The caller does not determine the types of the Go variables in the values slice. In this way it behaves similar to Execute, but providing "for each row" semantics so that every row does not need to be loaded into memory. Table statistics collection: beginning with a simplified sql.Statistics struct based on the types proposed in the initial unmerged query cost branch, the pg package provides the following new methods aimed at the (relatively expensive) collection of ground truth table statistics: - RowCount provides an exact row count - colStats computes column-wise statistics - TableStats uses the above functions to build a *sql.Statistics for a table. These methods are will not be used routinely. We will have incremental updates, but there are cases where a full scan may be needed to obtain the ground truth statistics. pg: decimal and uint256 use pgNumericToDecimal helper Use the pgNumericToDecimal helper to reuse the logic to convert from pgtypes.Numeric to either our decimal.Decimal or types.Uint256 in the recent pgtype decoding added to the query helper for interpreting the values returned by row.Values() in pgx.CollectRows. types,decimal: sql scan/value for uint256 and decimal and arrays nulls with uint256 and decimal deps: update pgx module from 5.5.5 to 5.6.0
af24b58
to
85a82e9
Compare
This at least partially resolves #800, and provides some foundational components for table statistics collection (#410). I have more work done on stats in both engine and pg for incremental updates, but the pieces in this PR were substantial, so I pulled them out for easier review.
This adds functionality to the
pg
package:Table inspection: the
ColumnInfo
function and theColInfo
andColType
types provide a way to inspect any postgresql table without knowledge of the scheme. The(*ColInfo).ScanVal
method provides an instance of a suitable type into which a column expression may be scanned in one of the new advanced query methods described below.Advanced query execution:
the
sql.QueryScanner
interface is the advanced version ofExecute
that uses caller-provided scan values and a function to run for each scanned row:each transaction type in the
pg
package satisfies thesql.QueryScanner
interfacethe
pg.QueryRowFunc
function executes an SQL statement, handling the rows and returned values as described by thesql.QueryScanner
interface.the
pg.QueryRowFuncAny
is similar topg.QueryRowFunc
, except that no scan values slice is provided. The provided function is called for each row of the result. The caller does not determine the types of the Go variables in the values slice. In this way it behaves similar toExecute
, but providing "for each row" semantics so that every row does not need to be loaded into memory.Table statistics collection: beginning with a simplified
sql.Statistics
struct based on the types proposed in [WIP] Feature/sql exec cost estimate #603, thepg
package provides the following new methods aimed at the (relatively expensive) collection of ground truth table statistics:RowCount
provides an exact row countcolStats
computes column-wise statisticsTableStats
uses the above functions to build a*sql.Statistics
for a table.These methods are will not be used routinely. We will have incremental updates, but there are cases where a full scan may be needed to obtain the ground truth statistics.
Use the
pgNumericToDecimal
helper to reuse the logic to convert frompgtypes.Numeric
to either ourdecimal.Decimal
ortypes.Uint256
in the recent pgtype decoding added to thequery
helper for interpreting the values returned byrow.Values()
inpgx.CollectRows
.Note that this PR has three separate commits for easier review, but they can be squashed when merged.