-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
π Update initial load query for old postgres to return a defined order β¦ #31328
π Update initial load query for old postgres to return a defined order β¦ #31328
Conversation
The latest updates on your projects. Learn more about Vercel for Git βοΈ
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! π To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
Coverage report for source-postgres
|
source-postgres test report (commit
|
Step | Result |
---|---|
Build connector tar | β |
Build source-postgres docker image for platform(s) linux/x86_64 | β |
Java Connector Unit Tests | β |
Java Connector Integration Tests | β |
Acceptance tests | β |
Validate metadata for source-postgres | β |
Connector version semver check | β |
Connector version increment check | β |
QA checks | β |
π View the logs here
βοΈ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-postgres test
source-postgres test report (commit
|
Step | Result |
---|---|
Build connector tar | β |
Build source-postgres docker image for platform(s) linux/x86_64 | β |
Java Connector Unit Tests | β |
Java Connector Integration Tests | β |
Acceptance tests | β |
Validate metadata for source-postgres | β |
Connector version semver check | β |
Connector version increment check | β |
QA checks | β |
π View the logs here
βοΈ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-postgres test
LOGGER.info("Preparing query for table: {}", tableName); | ||
final String fullTableName = getFullyQualifiedTableNameWithQuoting(schemaName, tableName, | ||
quoteString); | ||
final String wrappedColumnNames = RelationalDbQueryUtils.enquoteIdentifierList(columnNames, quoteString); | ||
final String sql = | ||
"SELECT ctid::text, %s FROM %s WHERE ctid = ANY (ARRAY (SELECT FORMAT('(%%s,%%s)', page, tuple)::tid FROM generate_series(?, ?) as page, generate_series(?,?) as tuple))" | ||
"SELECT ctid::text, %s FROM %s WHERE ctid = ANY (ARRAY (SELECT FORMAT('(%%s,%%s)', page, tuple)::tid tid_addr FROM generate_series(?, ?) as page, generate_series(?,?) as tuple ORDER BY tid_addr))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check for any other queries that bring in values we use in checkpointing that also need order by
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other ctid query is WHERE ctid > '0,0' AND ctid <= '(131000,0)'
this >< range is returning rows in sequential order - it was created as an optimization for old postgres. So I don't think we need to sort there ourselves.
With the postgres 12 query, there is some logic applied by postgres that it iterates over the bigger of two ranges first (pages) . I couldn't find a way to control it other than sorting the array.
Other incremental queries are all or nothing - so if an error happened in the middle of incremental xmin for example, it will not checkpoint at all. so even if there is some case records are out of order we will be good (I don't think this can happen anywhere) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general recommendation is to always add a sort when you want data sorted.
SQL query optimizers can easily remove the sorting step when it's not needed (or chose a query plan that's cheaper with an implicit sort than one that would be cheaper without a sort but for which the sort is expensive).
Query plans can also change from release to release, which could change the order of the results, absent an explicit sort. The real signal to look for is a change in query cost and query performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree Stephane
I added an explicit test that is taking a large amount of records (> single page) and makes sure records are received in order.
Because the TID Range Scan for newer postgres versions was created with the purpose of making an optimized scan, I'd prefer to not add an expensive sort. there can be millions and millions of records on each chunk we read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rodireich but if pg 14+ is already sorting when issuing a WHERE ctid > '0,0' AND ctid <= '(131000,0)'
scan I agree with @stephane-airbyte that we should add an explicit ORDER BY here.
There are 2 scenarios :
- It's already ordering by ctid in which case the query optimizer should ignore this.
- It isn't doing already doing this and the ORDER BY query adds significant latency.
If the order by for PG14+ adds significant latency, we can always :
- Tune chunk size so that the ORDER BY occurs in memory.
- Checkpoint at the end of every chunk and keep track of the largest CTID entry (and avoid the ORDER BY query at all)
I'm worried about a similar case in PG14+ where there are some cases where hte TID range scan is not returning records in order.
source-postgres-strict-encrypt test report (commit
|
Step | Result |
---|---|
Build connector tar | β |
Build source-postgres-strict-encrypt docker image for platform(s) linux/x86_64 | β |
Java Connector Unit Tests | β |
Java Connector Integration Tests | β |
Acceptance tests | β |
Validate metadata for source-postgres-strict-encrypt | β |
Connector version semver check | β |
QA checks | β |
π View the logs here
βοΈ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-postgres-strict-encrypt test
source-postgres test report (commit
|
Step | Result |
---|---|
Build connector tar | β |
Build source-postgres docker image for platform(s) linux/x86_64 | β |
Java Connector Unit Tests | β |
Java Connector Integration Tests | β |
Acceptance tests | β |
Validate metadata for source-postgres | β |
Connector version semver check | β |
Connector version increment check | β |
QA checks | β |
π View the logs here
βοΈ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-postgres test
β¦r β¦ (airbytehq#31328) Co-authored-by: rodireich <[email protected]>
What
The order in which records are returned for ctid initial load in postgres 12, 13 is undefined.
As a result, records that are returned out of order may lead to a loss of records in case of an error that triggers another attempt.
What can happen is that instead of records read in the order of
(0,1); (0,2); (0,3)β¦
We will read in the order of
(0,1); (1,1); (2,1)β¦
How
Update the query used for legacy ctid load to define an order of records.