Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-warehouse): Added incremental syncs for postgres #23145

Merged
merged 11 commits into from
Jun 24, 2024

Conversation

Gilbert09
Copy link
Member

@Gilbert09 Gilbert09 commented Jun 21, 2024

Firstly, sorry for the huge PR - this kinda snowballed a fair amount

Problem

  • We want to allow incremental syncs from SQL databases (e.g. postgres)

Changes

  • Syncing UI has changed
  • You now have to "set up" a tables sync type before you can enable syncs, this applies to both existing and new schemas
  • Setting up a table requires you to select either "full refresh" or "incremental" with an incremental field chosen
  • We can only use date, timestamp, and int fields as the incremental field
  • Changing the sync type or the incremental field on an existing table will require the table to be fully refreshed (a warning pops up to say this)
  • Have restricted what fields are updateable on ExternalDataSchema and added logic there for triggering refreshes etc
  • Added a migration for adding sync_type_payload onto the schema object. We'll be storing incremental_field and incremental_field_type here.
    • Also runs a SQL command to update existing incremental zendesk and stripe tables to pre-populate this column
  • When triggering a pipeline reset, we add a reset_pipeline value to the source's job_inputs which then gets reset at the end of the sync. This is to tell DLT to do a full reset of the table
    • Updated the DLT package to use an alpha build of the new version, this gives us the ability to DLT to do a clean new sync (docs)
Screen.Recording.2024-06-21.at.13.56.26.mov
Screen.Recording.2024-06-21.at.13.57.29.mov

Does this work well for both Cloud and self-hosted?

Yup

How did you test this code?

  • Tested extensively locally
  • Added unit tests too

@Gilbert09 Gilbert09 requested a review from EDsCODE June 21, 2024 12:54
Comment on lines +13 to +17
class IncrementalField(TypedDict):
label: str # Label shown in the UI
type: IncrementalFieldType # Field type shown in the UI
field: str # Actual DB field accessed
field_type: IncrementalFieldType # Actual DB type of the field
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the extra fields for sources where the incremental field is actually masked by an ast.ExpressionField - e.g. in Stripe tables, we expose a created_at datetime, but the underlying data and Stripe API is actually created and integer.

Copy link
Contributor

github-actions bot commented Jun 21, 2024

Size Change: +197 B (+0.02%)

Total Size: 1.06 MB

ℹ️ View Unchanged
Filename Size Change
frontend/dist/toolbar.js 1.06 MB +197 B (+0.02%)

compressed-size-action

Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited for this! Been a long time coming for properly supported incremental

Added some comments that are mostly NIT and i saw some of your comments that noted necessary cleanup. Otherwise ready to try it out

posthog/warehouse/models/external_data_schema.py Outdated Show resolved Hide resolved
posthog/warehouse/api/test/conftest.py Outdated Show resolved Hide resolved
except Exception as e:
logger.exception(f"Could not clean up data import folder: {job.folder_path}", exc_info=e)
logger.exception(f"Could not clean up data import folder: {job.folder_path()}", exc_info=e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for shifting these from property to function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needed to be wrapped in a sync_to_async call in a test because calling the property was failing from async issues - couldn't figure out how to do the wrapping as a property, and so just changed it to a func instead

posthog/temporal/data_imports/pipelines/pipeline.py Outdated Show resolved Hide resolved
for name, field_type in filter_postgres_incremental_fields(columns)
]
elif source.source_type == ExternalDataSource.Type.SNOWFLAKE:
# TODO(@Gilbert09): Move all this into a util and replace elsewhere
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Gilbert09 Gilbert09 merged commit 56f8de7 into master Jun 24, 2024
92 checks passed
@Gilbert09 Gilbert09 deleted the tom/incremental-postgres branch June 24, 2024 11:47
Copy link

sentry-io bot commented Jun 24, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ PipelineStepFailed: Pipeline execution failed at stage normalize when processing package 1720070374.407156 with excep... posthog.temporal.data_imports.pipelines.pipelin... View Issue
  • ‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1720058292.2765305 with except... posthog.temporal.data_imports.pipelines.pipelin... View Issue
  • ‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1720027275.764863 with exception: posthog.temporal.data_imports.pipelines.pipelin... View Issue
  • ‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1720064913.2371342 with except... posthog.temporal.data_imports.pipelines.pipelin... View Issue
  • ‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1720043510.6308787 with except... posthog.temporal.data_imports.pipelines.pipelin... View Issue

Did you find this useful? React with a 👍 or 👎

timgl pushed a commit that referenced this pull request Jun 27, 2024
* Added incremental syncs for postgres

* Tests

* Fixed mypy

* Added extra protection on the sync form

* Set the correct write disposition for SQL

* PR fixes

* Fixed tests

* Fixed tests

* Fixed tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants