feat(data-warehouse): Added incremental syncs for postgres #23145

Gilbert09 · 2024-06-21T12:54:11Z

Firstly, sorry for the huge PR - this kinda snowballed a fair amount

Problem

We want to allow incremental syncs from SQL databases (e.g. postgres)

Changes

Syncing UI has changed
You now have to "set up" a tables sync type before you can enable syncs, this applies to both existing and new schemas
Setting up a table requires you to select either "full refresh" or "incremental" with an incremental field chosen
We can only use date, timestamp, and int fields as the incremental field
Changing the sync type or the incremental field on an existing table will require the table to be fully refreshed (a warning pops up to say this)
Have restricted what fields are updateable on ExternalDataSchema and added logic there for triggering refreshes etc
Added a migration for adding sync_type_payload onto the schema object. We'll be storing incremental_field and incremental_field_type here.
- Also runs a SQL command to update existing incremental zendesk and stripe tables to pre-populate this column
When triggering a pipeline reset, we add a reset_pipeline value to the source's job_inputs which then gets reset at the end of the sync. This is to tell DLT to do a full reset of the table
- Updated the DLT package to use an alpha build of the new version, this gives us the ability to DLT to do a clean new sync (docs)

Screen.Recording.2024-06-21.at.13.56.26.mov

Screen.Recording.2024-06-21.at.13.57.29.mov

Does this work well for both Cloud and self-hosted?

Yup

How did you test this code?

Tested extensively locally
Added unit tests too

Gilbert09 · 2024-06-21T12:55:31Z

posthog/warehouse/types.py

+class IncrementalField(TypedDict):
+    label: str  # Label shown in the UI
+    type: IncrementalFieldType  # Field type shown in the UI
+    field: str  # Actual DB field accessed
+    field_type: IncrementalFieldType  # Actual DB type of the field


We have the extra fields for sources where the incremental field is actually masked by an ast.ExpressionField - e.g. in Stripe tables, we expose a created_at datetime, but the underlying data and Stripe API is actually created and integer.

github-actions · 2024-06-21T13:04:45Z

Size Change: +197 B (+0.02%)

Total Size: 1.06 MB

ℹ️ View Unchanged

Filename	Size	Change
`frontend/dist/toolbar.js`	1.06 MB	+197 B (+0.02%)

_{compressed-size-action}

EDsCODE

Super excited for this! Been a long time coming for properly supported incremental

Added some comments that are mostly NIT and i saw some of your comments that noted necessary cleanup. Otherwise ready to try it out

posthog/warehouse/models/external_data_schema.py

posthog/warehouse/api/test/conftest.py

EDsCODE · 2024-06-24T01:29:07Z

posthog/warehouse/api/external_data_source.py

            except Exception as e:
-                logger.exception(f"Could not clean up data import folder: {job.folder_path}", exc_info=e)
+                logger.exception(f"Could not clean up data import folder: {job.folder_path()}", exc_info=e)


What was the reason for shifting these from property to function?

It needed to be wrapped in a sync_to_async call in a test because calling the property was failing from async issues - couldn't figure out how to do the wrapping as a property, and so just changed it to a func instead

posthog/temporal/data_imports/pipelines/pipeline.py

EDsCODE · 2024-06-24T02:10:44Z

posthog/warehouse/api/external_data_schema.py

+                for name, field_type in filter_postgres_incremental_fields(columns)
+            ]
+        elif source.source_type == ExternalDataSource.Type.SNOWFLAKE:
+            # TODO(@Gilbert09): Move all this into a util and replace elsewhere


sentry-io · 2024-06-24T12:05:33Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ PipelineStepFailed: Pipeline execution failed at stage normalize when processing package 1720070374.407156 with excep... posthog.temporal.data_imports.pipelines.pipelin... View Issue
‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1720058292.2765305 with except... posthog.temporal.data_imports.pipelines.pipelin... View Issue
‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1720027275.764863 with exception: posthog.temporal.data_imports.pipelines.pipelin... View Issue
‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1720064913.2371342 with except... posthog.temporal.data_imports.pipelines.pipelin... View Issue
‼️ PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1720043510.6308787 with except... posthog.temporal.data_imports.pipelines.pipelin... View Issue

_{Did you find this useful? React with a 👍 or 👎}

* Added incremental syncs for postgres * Tests * Fixed mypy * Added extra protection on the sync form * Set the correct write disposition for SQL * PR fixes * Fixed tests * Fixed tests * Fixed tests

Gilbert09 added 4 commits June 21, 2024 10:10

Added incremental syncs for postgres

3c18d54

Merge branch 'master' into tom/incremental-postgres

4e16758

Tests

2d9264b

Fixed mypy

bd001a7

Gilbert09 requested a review from EDsCODE June 21, 2024 12:54

Gilbert09 commented Jun 21, 2024

View reviewed changes

Gilbert09 added 2 commits June 21, 2024 14:53

Added extra protection on the sync form

c90dc77

Set the correct write disposition for SQL

e1cec02

EDsCODE approved these changes Jun 24, 2024

View reviewed changes

Gilbert09 and others added 5 commits June 24, 2024 11:07

PR fixes

0c4059b

Merge branch 'master' into tom/incremental-postgres

c7687e8

Fixed tests

b9d5841

Fixed tests

e9fe396

Fixed tests

fa6df18

Gilbert09 merged commit 56f8de7 into master Jun 24, 2024
92 checks passed

Gilbert09 deleted the tom/incremental-postgres branch June 24, 2024 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-warehouse): Added incremental syncs for postgres #23145

feat(data-warehouse): Added incremental syncs for postgres #23145

Gilbert09 commented Jun 21, 2024 •

edited

Loading

Gilbert09 Jun 21, 2024

github-actions bot commented Jun 21, 2024 •

edited

Loading

EDsCODE left a comment

EDsCODE Jun 24, 2024

Gilbert09 Jun 24, 2024

EDsCODE Jun 24, 2024

sentry-io bot commented Jun 24, 2024 •

edited

Loading

feat(data-warehouse): Added incremental syncs for postgres #23145

feat(data-warehouse): Added incremental syncs for postgres #23145

Conversation

Gilbert09 commented Jun 21, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

Gilbert09 Jun 21, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 21, 2024 • edited Loading

EDsCODE left a comment

Choose a reason for hiding this comment

EDsCODE Jun 24, 2024

Choose a reason for hiding this comment

Gilbert09 Jun 24, 2024

Choose a reason for hiding this comment

EDsCODE Jun 24, 2024

Choose a reason for hiding this comment

sentry-io bot commented Jun 24, 2024 • edited Loading

Suspect Issues

Gilbert09 commented Jun 21, 2024 •

edited

Loading

github-actions bot commented Jun 21, 2024 •

edited

Loading

sentry-io bot commented Jun 24, 2024 •

edited

Loading