Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bigquery autodetect #2035

Merged
merged 6 commits into from
Nov 13, 2024
Merged

fix bigquery autodetect #2035

merged 6 commits into from
Nov 13, 2024

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Nov 7, 2024

Description

When using autodetect_schema on bigquery, merge and replace with "insert-from-staging" will fail, because dlt is:

  1. Trying to truncate non-existing tables (this is solveable fairly easy, see first commit)
  2. On the first load trying to insert data into a non existing table (also easily solvable by creating the correct final table from the staging table before the merge is run)

Problems: The suggested approach in this ticket will not work if the schema of the table changes, because bigquery has no commands to update the schema of an existing table to match the schema of another existing table. So the staging table will auto evolve but it's not easy to have the main table follow this without directly inserting data there.

Possible solutions:

  • Disallow schema evolutions when merge and autodetect is enabled. The users would have to manually do this.
  • Write a bit of code that can update the main table to match the staging table by reflecting the information schema. This probably is fairly time consuming to do on every load. We'd have to insert some update commands between loading the staging jobs which change the schema and executing the merge sql
  • Update the schemas of all tables, staging and final, by inserting data from a parquet file and selecting a partition that does not exist with auto schema updates enabled. This way, according to my research, the schema will be updated but no data is loaded.

@sh-rp sh-rp added the bug Something isn't working label Nov 7, 2024
@sh-rp sh-rp self-assigned this Nov 7, 2024
Copy link

netlify bot commented Nov 7, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit f515820
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67338091793bbe000833d88d

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 11, 2024

The proposed quick fix for this would be to enable merging and staging replace for bigquery auto schema detection based on cloning the table schema from staging to final table in bigquery. Merges with schema updates will fail though, either we somehow freeze the schema for these tables or we add a very big note in the docs about this. Maybe we can amend the exception message for a fail like this with some more info about why this might have failed for bigquery tables with autoschema.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes the case without schema evolution. it is a good first attempt (btw. I think autodetect schema cannot really evolve existing tables so we are fine!
https://www.reddit.com/r/bigquery/comments/1c02azc/autodetecting_updated_schema_when/?rdt=38122)

# NOTE: we also need to create all child tables
# NOTE: this will not work if the schema of the staging table changes in the next run..
# in some cases we need to create final tables here
sql.append(f"CREATE TABLE IF NOT EXISTS {root_table_name} LIKE {staging_root_table_name};")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code should be executed at the end of each copy job - and only if autodetect schema is enabled. this way we can just run our merge job as usual.

we could also migrate columns from staging table to destination tables but maybe we can start with just cloning

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have prepended it so the bigquery merge jobs for autodetect tables now. The other solution would be much more complicated with two consecutive followupjobs. I can do it that way too, but I think my solution is fairly clean...

@@ -273,6 +274,20 @@ def _make_database_exception(cls, ex: Exception) -> Exception:
# anything else is transient
return DatabaseTransientException(ex)

def truncate_tables(self, *tables: str) -> None:
"""NOTE: We only truncate tables that exist, for auto-detect schema we don't know which tables exist"""
statements: List[str] = ["DECLARE table_exists BOOL;"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do it only when autodetect schema is on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sh-rp sh-rp force-pushed the bug/1876-fix-bigquery-autodetect branch from 5ed912a to 92deee7 Compare November 12, 2024 14:28
@sh-rp sh-rp marked this pull request as ready for review November 12, 2024 15:45
@sh-rp sh-rp changed the title [wip] fix bigquery autodetect fix bigquery autodetect Nov 12, 2024
@sh-rp sh-rp requested a review from rudolfix November 12, 2024 16:04
rudolfix
rudolfix previously approved these changes Nov 12, 2024
@sh-rp sh-rp force-pushed the bug/1876-fix-bigquery-autodetect branch from 6e2c8cb to a546667 Compare November 12, 2024 16:19
@sh-rp sh-rp force-pushed the bug/1876-fix-bigquery-autodetect branch from a546667 to f515820 Compare November 12, 2024 16:21
@sh-rp sh-rp merged commit 1346b3b into devel Nov 13, 2024
58 of 59 checks passed
@sh-rp sh-rp deleted the bug/1876-fix-bigquery-autodetect branch November 13, 2024 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants