Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ› Update initial load query for old postgres to return a defined order … #31328

Merged
merged 5 commits into from
Oct 12, 2023

Conversation

rodireich
Copy link
Contributor

@rodireich rodireich commented Oct 12, 2023

What

The order in which records are returned for ctid initial load in postgres 12, 13 is undefined.
As a result, records that are returned out of order may lead to a loss of records in case of an error that triggers another attempt.

What can happen is that instead of records read in the order of (0,1); (0,2); (0,3)…
We will read in the order of (0,1); (1,1); (2,1)…

How

Update the query used for legacy ctid load to define an order of records.

@rodireich rodireich requested a review from a team as a code owner October 12, 2023 03:59
@vercel
Copy link

vercel bot commented Oct 12, 2023

The latest updates on your projects. Learn more about Vercel for Git β†—οΈŽ

Name Status Preview Comments Updated (UTC)
airbyte-docs βœ… Ready (Inspect) Visit Preview πŸ’¬ Add feedback Oct 12, 2023 10:53pm

@github-actions
Copy link
Contributor

github-actions bot commented Oct 12, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! πŸŽ‰

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 12, 2023

Coverage report for source-postgres

File Coverage [90.14%] 🍏
InitialSyncCtidIterator.java 90.14% 🍏
Total Project Coverage 71.71% 🍏

@airbyte-oss-build-runner
Copy link
Collaborator

source-postgres test report (commit 054f941147) - ❌

⏲️ Total pipeline duration: 28mn03s

Step Result
Build connector tar βœ…
Build source-postgres docker image for platform(s) linux/x86_64 βœ…
Java Connector Unit Tests βœ…
Java Connector Integration Tests βœ…
Acceptance tests βœ…
Validate metadata for source-postgres βœ…
Connector version semver check βœ…
Connector version increment check ❌
QA checks βœ…

πŸ”— View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-postgres test

@airbyte-oss-build-runner
Copy link
Collaborator

source-postgres test report (commit bbfadb36be) - ❌

⏲️ Total pipeline duration: 29mn11s

Step Result
Build connector tar βœ…
Build source-postgres docker image for platform(s) linux/x86_64 βœ…
Java Connector Unit Tests βœ…
Java Connector Integration Tests βœ…
Acceptance tests βœ…
Validate metadata for source-postgres βœ…
Connector version semver check βœ…
Connector version increment check ❌
QA checks βœ…

πŸ”— View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-postgres test

LOGGER.info("Preparing query for table: {}", tableName);
final String fullTableName = getFullyQualifiedTableNameWithQuoting(schemaName, tableName,
quoteString);
final String wrappedColumnNames = RelationalDbQueryUtils.enquoteIdentifierList(columnNames, quoteString);
final String sql =
"SELECT ctid::text, %s FROM %s WHERE ctid = ANY (ARRAY (SELECT FORMAT('(%%s,%%s)', page, tuple)::tid FROM generate_series(?, ?) as page, generate_series(?,?) as tuple))"
"SELECT ctid::text, %s FROM %s WHERE ctid = ANY (ARRAY (SELECT FORMAT('(%%s,%%s)', page, tuple)::tid tid_addr FROM generate_series(?, ?) as page, generate_series(?,?) as tuple ORDER BY tid_addr))"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for any other queries that bring in values we use in checkpointing that also need order by?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other ctid query is WHERE ctid > '0,0' AND ctid <= '(131000,0)'
this >< range is returning rows in sequential order - it was created as an optimization for old postgres. So I don't think we need to sort there ourselves.

With the postgres 12 query, there is some logic applied by postgres that it iterates over the bigger of two ranges first (pages) . I couldn't find a way to control it other than sorting the array.

Other incremental queries are all or nothing - so if an error happened in the middle of incremental xmin for example, it will not checkpoint at all. so even if there is some case records are out of order we will be good (I don't think this can happen anywhere) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general recommendation is to always add a sort when you want data sorted.

SQL query optimizers can easily remove the sorting step when it's not needed (or chose a query plan that's cheaper with an implicit sort than one that would be cheaper without a sort but for which the sort is expensive).
Query plans can also change from release to release, which could change the order of the results, absent an explicit sort. The real signal to look for is a change in query cost and query performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree Stephane
I added an explicit test that is taking a large amount of records (> single page) and makes sure records are received in order.
Because the TID Range Scan for newer postgres versions was created with the purpose of making an optimized scan, I'd prefer to not add an expensive sort. there can be millions and millions of records on each chunk we read

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rodireich but if pg 14+ is already sorting when issuing a WHERE ctid > '0,0' AND ctid <= '(131000,0)' scan I agree with @stephane-airbyte that we should add an explicit ORDER BY here.

There are 2 scenarios :

  1. It's already ordering by ctid in which case the query optimizer should ignore this.
  2. It isn't doing already doing this and the ORDER BY query adds significant latency.

If the order by for PG14+ adds significant latency, we can always :

  1. Tune chunk size so that the ORDER BY occurs in memory.
  2. Checkpoint at the end of every chunk and keep track of the largest CTID entry (and avoid the ORDER BY query at all)

I'm worried about a similar case in PG14+ where there are some cases where hte TID range scan is not returning records in order.

@rodireich rodireich changed the title Update initial load query for old postgres to return a defined order … πŸ› Update initial load query for old postgres to return a defined order … Oct 12, 2023
@airbyte-oss-build-runner
Copy link
Collaborator

source-postgres-strict-encrypt test report (commit f6c0f6267f) - βœ…

⏲️ Total pipeline duration: 10mn37s

Step Result
Build connector tar βœ…
Build source-postgres-strict-encrypt docker image for platform(s) linux/x86_64 βœ…
Java Connector Unit Tests βœ…
Java Connector Integration Tests βœ…
Acceptance tests βœ…
Validate metadata for source-postgres-strict-encrypt βœ…
Connector version semver check βœ…
QA checks βœ…

πŸ”— View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-postgres-strict-encrypt test

@airbyte-oss-build-runner
Copy link
Collaborator

source-postgres test report (commit f6c0f6267f) - βœ…

⏲️ Total pipeline duration: 28mn31s

Step Result
Build connector tar βœ…
Build source-postgres docker image for platform(s) linux/x86_64 βœ…
Java Connector Unit Tests βœ…
Java Connector Integration Tests βœ…
Acceptance tests βœ…
Validate metadata for source-postgres βœ…
Connector version semver check βœ…
Connector version increment check βœ…
QA checks βœ…

πŸ”— View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-postgres test

@rodireich rodireich merged commit 12692ce into master Oct 12, 2023
19 of 23 checks passed
@rodireich rodireich deleted the 3240-source-postgres-missing-data-from-cdc-snapshot branch October 12, 2023 23:40
ariesgun pushed a commit to ariesgun/airbyte that referenced this pull request Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants