🐛 Update initial load query for old postgres to return a defined order … #31328

rodireich · 2023-10-12T03:59:19Z

What

The order in which records are returned for ctid initial load in postgres 12, 13 is undefined.
As a result, records that are returned out of order may lead to a loss of records in case of an error that triggers another attempt.

What can happen is that instead of records read in the order of (0,1); (0,2); (0,3)…
We will read in the order of (0,1); (1,1); (2,1)…

How

Update the query used for legacy ctid load to define an order of records.

…of records

vercel · 2023-10-12T03:59:27Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 12, 2023 10:53pm

github-actions · 2023-10-12T03:59:43Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

github-actions · 2023-10-12T04:22:19Z

Coverage report for source-postgres

File	Coverage [90.14%]	🍏
InitialSyncCtidIterator.java	90.14%	🍏

Total Project Coverage	71.71%	🍏

airbyte-oss-build-runner · 2023-10-12T04:43:00Z

source-postgres test report (commit `054f941147`) - ❌

⏲️ Total pipeline duration: 28mn03s

Step	Result
Build connector tar	✅
Build source-postgres docker image for platform(s) linux/x86_64	✅
Java Connector Unit Tests	✅
Java Connector Integration Tests	✅
Acceptance tests	✅
Validate metadata for source-postgres	✅
Connector version semver check	✅
Connector version increment check	❌
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-postgres test

…-snapshot

airbyte-oss-build-runner · 2023-10-12T18:10:40Z

source-postgres test report (commit `bbfadb36be`) - ❌

⏲️ Total pipeline duration: 29mn11s

Step	Result
Build connector tar	✅
Build source-postgres docker image for platform(s) linux/x86_64	✅
Java Connector Unit Tests	✅
Java Connector Integration Tests	✅
Acceptance tests	✅
Validate metadata for source-postgres	✅
Connector version semver check	✅
Connector version increment check	❌
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-postgres test

prateekmukhedkar · 2023-10-12T18:36:33Z

...gres/src/main/java/io/airbyte/integrations/source/postgres/ctid/InitialSyncCtidIterator.java

      LOGGER.info("Preparing query for table: {}", tableName);
      final String fullTableName = getFullyQualifiedTableNameWithQuoting(schemaName, tableName,
          quoteString);
      final String wrappedColumnNames = RelationalDbQueryUtils.enquoteIdentifierList(columnNames, quoteString);
      final String sql =
-          "SELECT ctid::text, %s FROM %s WHERE ctid = ANY (ARRAY (SELECT FORMAT('(%%s,%%s)', page, tuple)::tid FROM generate_series(?, ?) as page, generate_series(?,?) as tuple))"
+          "SELECT ctid::text, %s FROM %s WHERE ctid = ANY (ARRAY (SELECT FORMAT('(%%s,%%s)', page, tuple)::tid tid_addr FROM generate_series(?, ?) as page, generate_series(?,?) as tuple ORDER BY tid_addr))"


Do we need to check for any other queries that bring in values we use in checkpointing that also need order by?

the other ctid query is WHERE ctid > '0,0' AND ctid <= '(131000,0)'
this >< range is returning rows in sequential order - it was created as an optimization for old postgres. So I don't think we need to sort there ourselves.

With the postgres 12 query, there is some logic applied by postgres that it iterates over the bigger of two ranges first (pages) . I couldn't find a way to control it other than sorting the array.

Other incremental queries are all or nothing - so if an error happened in the middle of incremental xmin for example, it will not checkpoint at all. so even if there is some case records are out of order we will be good (I don't think this can happen anywhere) .

My general recommendation is to always add a sort when you want data sorted.

SQL query optimizers can easily remove the sorting step when it's not needed (or chose a query plan that's cheaper with an implicit sort than one that would be cheaper without a sort but for which the sort is expensive).
Query plans can also change from release to release, which could change the order of the results, absent an explicit sort. The real signal to look for is a change in query cost and query performance.

I agree Stephane
I added an explicit test that is taking a large amount of records (> single page) and makes sure records are received in order.
Because the TID Range Scan for newer postgres versions was created with the purpose of making an optimized scan, I'd prefer to not add an expensive sort. there can be millions and millions of records on each chunk we read

@rodireich but if pg 14+ is already sorting when issuing a WHERE ctid > '0,0' AND ctid <= '(131000,0)' scan I agree with @stephane-airbyte that we should add an explicit ORDER BY here.

There are 2 scenarios :

It's already ordering by ctid in which case the query optimizer should ignore this.

It isn't doing already doing this and the ORDER BY query adds significant latency.

If the order by for PG14+ adds significant latency, we can always :

Tune chunk size so that the ORDER BY occurs in memory.

Checkpoint at the end of every chunk and keep track of the largest CTID entry (and avoid the ORDER BY query at all)

I'm worried about a similar case in PG14+ where there are some cases where hte TID range scan is not returning records in order.

…-snapshot

airbyte-oss-build-runner · 2023-10-12T23:05:06Z

source-postgres-strict-encrypt test report (commit `f6c0f6267f`) - ✅

⏲️ Total pipeline duration: 10mn37s

Step	Result
Build connector tar	✅
Build source-postgres-strict-encrypt docker image for platform(s) linux/x86_64	✅
Java Connector Unit Tests	✅
Java Connector Integration Tests	✅
Acceptance tests	✅
Validate metadata for source-postgres-strict-encrypt	✅
Connector version semver check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-postgres-strict-encrypt test

airbyte-oss-build-runner · 2023-10-12T23:33:48Z

source-postgres test report (commit `f6c0f6267f`) - ✅

⏲️ Total pipeline duration: 28mn31s

Step	Result
Build connector tar	✅
Build source-postgres docker image for platform(s) linux/x86_64	✅
Java Connector Unit Tests	✅
Java Connector Integration Tests	✅
Acceptance tests	✅
Validate metadata for source-postgres	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-postgres test

…r … (airbytehq#31328) Co-authored-by: rodireich <[email protected]>

Update initial load query for old postgres to return a defined order …

9b50a81

…of records

rodireich requested a review from a team as a code owner October 12, 2023 03:59

octavia-squidington-iii added area/connectors Connector related issues connectors/source/postgres labels Oct 12, 2023

Automated Commit - Formatting Changes

054f941

Merge branch 'master' into 3240-source-postgres-missing-data-from-cdc…

bbfadb3

…-snapshot

vercel bot deployed to Preview October 12, 2023 17:42 View deployment

prateekmukhedkar approved these changes Oct 12, 2023

View reviewed changes

Merge branch 'master' into 3240-source-postgres-missing-data-from-cdc…

05916aa

…-snapshot

vercel bot deployed to Preview October 12, 2023 22:49 View deployment

bump versions and changelog

f6c0f62

rodireich changed the title ~~Update initial load query for old postgres to return a defined order …~~ 🐛 Update initial load query for old postgres to return a defined order … Oct 12, 2023

vercel bot deployed to Preview October 12, 2023 22:53 View deployment

octavia-squidington-iii added connectors/source/postgres-strict-encrypt labels Oct 12, 2023

rodireich merged commit 12692ce into master Oct 12, 2023
19 of 23 checks passed

rodireich deleted the 3240-source-postgres-missing-data-from-cdc-snapshot branch October 12, 2023 23:40

ariesgun pushed a commit to ariesgun/airbyte that referenced this pull request Oct 23, 2023

🐛 Update initial load query for old postgres to return a defined orde…

825528c

…r … (airbytehq#31328) Co-authored-by: rodireich <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Update initial load query for old postgres to return a defined order … #31328

🐛 Update initial load query for old postgres to return a defined order … #31328

rodireich commented Oct 12, 2023 •

edited

Loading

vercel bot commented Oct 12, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023 •

edited by rodireich

Loading

github-actions bot commented Oct 12, 2023 •

edited

Loading

airbyte-oss-build-runner commented Oct 12, 2023

airbyte-oss-build-runner commented Oct 12, 2023

prateekmukhedkar Oct 12, 2023

rodireich Oct 12, 2023

stephane-airbyte Oct 12, 2023

rodireich Oct 12, 2023

akashkulk Oct 12, 2023

airbyte-oss-build-runner commented Oct 12, 2023

airbyte-oss-build-runner commented Oct 12, 2023

🐛 Update initial load query for old postgres to return a defined order … #31328

🐛 Update initial load query for old postgres to return a defined order … #31328

Conversation

rodireich commented Oct 12, 2023 • edited Loading

What

How

vercel bot commented Oct 12, 2023 • edited Loading

github-actions bot commented Oct 12, 2023 • edited by rodireich Loading

Before Merging a Connector Pull Request

github-actions bot commented Oct 12, 2023 • edited Loading

Coverage report for source-postgres

airbyte-oss-build-runner commented Oct 12, 2023

source-postgres test report (commit 054f941147) - ❌

airbyte-oss-build-runner commented Oct 12, 2023

source-postgres test report (commit bbfadb36be) - ❌

prateekmukhedkar Oct 12, 2023

Choose a reason for hiding this comment

rodireich Oct 12, 2023

Choose a reason for hiding this comment

stephane-airbyte Oct 12, 2023

Choose a reason for hiding this comment

rodireich Oct 12, 2023

Choose a reason for hiding this comment

akashkulk Oct 12, 2023

Choose a reason for hiding this comment

airbyte-oss-build-runner commented Oct 12, 2023

source-postgres-strict-encrypt test report (commit f6c0f6267f) - ✅

airbyte-oss-build-runner commented Oct 12, 2023

source-postgres test report (commit f6c0f6267f) - ✅

rodireich commented Oct 12, 2023 •

edited

Loading

vercel bot commented Oct 12, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023 •

edited by rodireich

Loading

github-actions bot commented Oct 12, 2023 •

edited

Loading

source-postgres test report (commit `054f941147`) - ❌

source-postgres test report (commit `bbfadb36be`) - ❌

source-postgres-strict-encrypt test report (commit `f6c0f6267f`) - ✅

source-postgres test report (commit `f6c0f6267f`) - ✅